fw/1 - Can not display foreign characters in views - lucee

I have got the following main Controller default action in fw/1 (framework one 4.2), where i define some rc scope variables to be displayed in views/main/default.cfm.
main.cfc
function default( struct rc ) {
rc.name="ΡΨΩΓΔ";
rc.dupl_name="üößä";
}
views/main/default.cfm
<cfoutput>
<p> #rc.name# </p>
<p> #rc.dupl_name# </p>
</cfoutput>
and finally in layouts/default.cfm
<cfoutput><!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>
A Simple FW/1 Layout Template
</title>
</head>
<body>
#body#
</body>
</html>
</cfoutput>
Unfortunately i receive the following output
ΡΨΩΓΔ
üößä
Any idea that could help me?
Regards

Because I already gave you a solution within my comments, I’m posting it as an answer here, so that others with similar issues may find the root of their charset issues.
The issue described above is very often the result of conflicting charset encodings. For example, if you are reading an ISO-8859-1 encoded file and outputting it with UTF8 without having them converted (de-/encoded) properly.
For the purpose of sending characters through a webapp as you are doing, the most common charset encoding as of today is UTF-8.
All you need to do is to harmonize characterset encodings or at least ensure that the encodings are set in such a manner that the processing engine is able to encode and decode the charsets correctly. What you can do in your case is:
Verify the encoding of the template .cfm file and make sure it’s saved as UTF-8
Define UTF-8 as your web charset in Lucee Administrator » Settings » Charset » Web charset
Define UTF-8 as your ressource charset in Lucee Administrator » Settings » Charset » Resource charset.
Make sure there are no other places where charset encodings are incorrectly set, such as a http server response header of “content-type: text/html;charset...” or a html meta-tag with a charset directive.
If you have other type of ressources, such as a database, then you may need to check those also (connection and database/table settings).
Note: That charset doesn’t always have to be UTF-8 (UTF-8 might use multiple bytes for certain characters). There may be use cases you would achieve the same result with single byte encodings, such as ISO-8559-1, that would also need less computing ressources and less payload.
I hope this may help others with similar issues.

Related

Haskell and the web (UTF-8)

I have a Haskell-Script what putStr some stuff ... and this output I want to display on a webpage. I compiled the Haskellscript and let it be executed like a CGI Script, works well.
But when I have special characters in the putStr, it will fail:
Terminal-Output:
<p class="text">what happrns at ä ?</p>
Web output:
what happrns at
And nothing behind it is displayed ... wha happened?
Some questions:
What Content-type header is your CGI script sending?
Are you setting the encoding on the stdout handle?
You should always send a Content-type header, and in that header you can stipulate the character encoding. See this page for more details.
Also, the encoding for Haskell file handles is determined by your OS's current default. It might be UTF-8, but it also might be something else. See this SO answer for more info.
So to make sure that the characters you send appear correctly on a web page, I would:
Send a Content-type header with charset=utf8
Make sure the encoding for stdout is also set to utf8:
import System.IO
...
hSetEncoding stdout utf8
If the Content-type header and your output encoding agree, things should work out.
Assuming that the Haskell program is properly outputting utf-8, you will still need to let the browser know what encoding you are using. In HTML 4, you can do this as follows
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
(add this inside the <head> tag).
In HTML 5 you can use
<meta charset="UTF-8">
You can verify that the Haskell program is outputting the correct format using xxd on the command line
./myProgram | xxd
This will show you the actual output in bytes, so that you can verify on a character by character basis.

Why do need metadata information specifying the encoding?

I feel a bit of a chicken and egg problem if i write a html meta tag specifying charset as say UTF-16 - like how do we decode the entire HTTP Request in the first place if we didn't know its UTF-16 data ? I believe request header needs to handle this and by the time we try to read metadata like say html tag charset="utf-16" we already know its UTF-16 .
Besides think one level higher about header information like Request Headers - are passed in ASCII as a standard ?
I mean at some level we need to agree upon and you can't set a data that is needed to decode as a metadata information . Can anyone clarify this ?
I am a bit confused on the idea of specifying a data that is needed to interpret the whole data as a metadata information inside the original data .
In general how can any form of encoding work if we don't have a standard agreed upon language/encoding to convey the data about the data itself ?
For example I am informed that Apache default has 8859-1 as the standard . So would all client need to enforce that for HTTP Headers and interpret the real content as UTF-8 if we want UTF-8 for the content-type ?
What character encoding should I use for a HTTP header? is a closely related question
UTF-16 (and other) encodings use a BOM (Byte Order Mark) that is read at the start of the file and that signals which encoding is being used. Only after that, the encoded part of the file begins.
For example, for UTF-16, you'll have the bytes FE FF if big-endian and FF FE if little-endian words are being used.
You also often see UTF-8 BOMs, although they don't need to be used (and may confuse some XML parsers).

Nodejs convert string into UTF-8

From my DB im getting the following string:
Johan Öbert
What it should say is:
Johan Öbert
I've tried to convert it into utf-8 like so:
nameString.toString("utf8");
But still same problem.
Any ideas?
I'd recommend using the Buffer object:
var someEncodedString = Buffer.from('someString', 'utf-8').toString();
This avoids any unnecessary dependencies that other answers require, since Buffer is included with node.js, and is already defined in the global scope.
Use the utf8 module from npm to encode/decode the string.
Installation:
npm install utf8
In a browser:
<script src="utf8.js"></script>
In Node.js:
const utf8 = require('utf8');
API:
Encode:
utf8.encode(string)
Encodes any given JavaScript string (string) as UTF-8, and returns the UTF-8-encoded version of the string. It throws an error if the input string contains a non-scalar value, i.e. a lone surrogate. (If you need to be able to encode non-scalar values as well, use WTF-8 instead.)
// U+00A9 COPYRIGHT SIGN; see http://codepoints.net/U+00A9
utf8.encode('\xA9');
// → '\xC2\xA9'
// U+10001 LINEAR B SYLLABLE B038 E; see http://codepoints.net/U+10001
utf8.encode('\uD800\uDC01');
// → '\xF0\x90\x80\x81'
Decode:
utf8.decode(byteString)
Decodes any given UTF-8-encoded string (byteString) as UTF-8, and returns the UTF-8-decoded version of the string. It throws an error when malformed UTF-8 is detected. (If you need to be able to decode encoded non-scalar values as well, use WTF-8 instead.)
utf8.decode('\xC2\xA9');
// → '\xA9'
utf8.decode('\xF0\x90\x80\x81');
// → '\uD800\uDC01'
// → U+10001 LINEAR B SYLLABLE B038 E
Resources
I had the same problem, when i loaded a text file via fs.readFile(), I tried to set the encodeing to UTF8, it keeped the same. my solution now is this:
myString = JSON.parse( JSON.stringify( myString ) )
after this an Ö is realy interpreted as an Ö.
When you want to change the encoding you always go from one into another. So you might go from Mac Roman to UTF-8 or from ASCII to UTF-8.
It's as important to know the desired output encoding as the current source encoding. For example if you have Mac Roman and you decode it from UTF-16 to UTF-8 you'll just make it garbled.
If you want to know more about encoding this article goes into a lot of details:
What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text
The npm pacakge encoding which uses node-iconv or iconv-lite should allow you to easily specify which source and output encoding you want:
var resultBuffer = encoding.convert(nameString, 'ASCII', 'UTF-8');
You should be setting the database connection's charset, instead of fighting it inside nodejs:
SET NAMES 'utf8';
(works at least in MySQL and PostgreSQL)
Keep in mind you need to run that for every connection. If you're using a connection pool, do it with an event handler, eg.:
mysqlPool.on('connection', function (connection) {
connection.query("SET NAMES 'utf8'")
});
https://dev.mysql.com/doc/refman/8.0/en/charset-connection.html#charset-connection-client-configuration
https://www.postgresql.org/docs/current/multibyte.html#id-1.6.10.5.7
https://www.npmjs.com/package/mysql#connection
Just add this <?xml version="1.0" encoding="UTF-8"?>, will encode. For instance, an RSS would be made with any char after adding this
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
>....
Also add to your parent layout or main app.html <meta charset="utf-8" />
<!DOCTYPE html>
<html lang="en" class="overflowhere">
<head>
<meta charset="utf-8" />
</head>
</html>

How to escape special characters from a markdown string?

I have a markdown file (utf8) that I am turning into a html file. My current setup is pretty straight forward (pseudo code):
var file = read(site.postLocation + '/in.md', 'utf8');
var escaped = marked( file );
write('out.html', escaped);
This works great, however I've now run into the issue where there are special characters in the markdown file (such as é) that will get messed up when viewed in a browser (é).
I've found a couple of npm modules that can convert html entities, however they all convert just about all convertable characters. Including those required by the markdown syntax (for example '#' becomes '&num;' and '.' becomes '&period;' and the markdown parser will fail.
I've tried the libs entities and node-iconv.
I imagine this being a pretty standerd problem. How can I only replace all strange letter characters without all the markdown required symbols?
As pointed out by hilarudeens I forgot to include meta charset html tag.
<meta charset="UTF-8" />
If you come across similar issues, I would suggest you check that first.

How to correctly interpret arguments with Umlaute from a JSP-page [duplicate]

This question already has answers here:
How to pass Unicode characters as JSP/Servlet request.getParameter?
(5 answers)
Closed 6 years ago.
I'm getting input from a text input field on a JSP page, that can contain Umlaute. (e.g. Ä,Ö,Ü,ä,ö,ü,ß).
Processing the input works fine for non-Umlaute. But whenever an Umlaut is entered in the Input field an incorrect value gets passed on.
For example:
If an "ä" (UTF-8:U+00E4) gets entered in the input field,
the String that is extracted from the argument is "ä" (UTF-8: U+00C3 and U+00A4)
It seems to me as if the UTF-8 hex encoding (which is c3 a4 for an "ä") gets used for the conversion.
How can I retrieve the correct value?
Here are snippets from the current implementation:
The JSP-page passes the input value "pk" on to the processing logic:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
...
<input type="text" name="<%=pk.toString()%>" value="<%=value%>" size="70"/>
<button type="submit" title="Edit" value='Save' onclick="action.value='doSave';pk.value='<%=pk.toString()%>'"><img src="icons/run.png"/>Save</button>
The value gets retrieved from args and converted to a string:
UUID pk = UUID.fromString(args.get("pk")); //$NON-NLS-1$
String value = args.get(pk.toString());
Note: Umlaute that are saved in the Database get displayed correctly on the page.
Note: Umlaute that are saved in the Database get displayed correctly on the page.
Due to this fact, I'll assume that you've already properly set the response encoding to UTF-8 by either <%#page pageEncoding="UTF-8"%> in JSP or by <jsp-config> in web.xml.
Left behind the request encoding. You wasn't clear in your question nor the code if you're using GET or POST. If you're using POST, then you'd need to create a servlet filter which explicitly sets the HTTP request body encoding:
request.setCharacterEncoding("UTF-8");
Or if you're using GET, then you'd need to dig in the server configuration to set the URI/parameter encoding to UTF-8. How to do that depends on the server used which is also not clear from the question, let alone from your question history. I'll therefore just give an example of Tomcat: set the URIEncoding attribute of the <Connector> element in Tomcat's /conf/server.xml:
<Connector ... URIEncoding="UTF-8">
See also:
Unicode - How to get the characters right?

Resources