UTF-8 string not displaying character properly - python-3.x

I am returning text field from DB column:
"algún nombre de juego"
when I try to use this text in an html file, I get the following string:
algún nombre de juego
This happens consistently with any string that is not pure english. I have even tried grabbing simple text from online in different languages and as soon as I plug them into an html text box, the print out with improper encoding like the example above.
From reading Unicode documentation for python, UTF-8 should be able to handle pretty much any character in any language.
I have tried many ways of encoding/decoding, and either there is an encoding error, or I get back weird character where the letter u with acute should be.
Per comments, I am not using Django or Flask. Just taking strings from a DB that might be in several different languages and generating an HTML file for internal use.

A few ideas:
1. Remember to declare the encoding. Either:
Insert a <meta charset="utf-8"> just after your <head> tag.
OR
do this in the HTTP header (making it look something like Content-Type:text/html; charset=UTF-8). This can be done by changing the webserver's settings or using server-side code, should you choose to host the page.
2. Ensure that your IDE saves files as UTF-8. This will be somewhere in your settings and should be done automatically, but worth checking if option 1 doesn't work.
Hope one of these works.

Related

NodeJS Jade (Pug) inline link in dynamic text

I have this NodeJS application, that uses Jade as template language. On one particular page, one text block is retrieved from the server, which reads the text from database.
The problem is, the returned text might contain line-breaks and links, and an operator might change this text at any time. How do I make these elements display correctly?
Most answers suggest using a new line:
p
| this is the start of the para
a(href='http://example.com') a link
| and this is the rest of the paragraph
But I cannot do this, since I cannot know when the a element appears. I've solved how to get newline correct, by this trick:
p
each l in line.description.split(/\n/)
= l
br
But I cannot seem to solve how to get links to render correctly. Does anyone know?
Edit:
I am open to any kind of format for links in the database, whatever would solve the issue. For example, say database contains the following text:
Hello!
We would like you to visit [a("http://www.google.com")Google]
Then we would like that to output text that looks like this:
Hello!
We would like you to visit Google
Looks like what you're looking for is unescaped string interpolation. The link does not work in the output because Pug automatically escapes it. Wrap the content you want to insert with !{} and it should stop breaking links. (Disclaimer: Make sure you don't leave user input unescaped - this only is a viable option if you know for sure the content of your DB does not have unwanted HTML/JS code in it.)
See this CodePen for illustration.
With this approach, you would need to use standard HTML tags (<a>) in your DB text. If you don't want that, you could have a look at Pug filters such as markdown-it (you will still need to un-escape the compilation output of that filter).

Why does Git have a 'Check ✓' mark in its querystring?

I was searching about something on Git when I noticed that in the query string, there's a parameter utf8 and its value is ✓! Not utf8=yes or utf8=true.
The full URL is following in Chrome and Firefox:
https://github.com/Modernizr/Modernizr/search?utf8=✓&q=browser&type=
But following in IE:
https://github.com/Modernizr/Modernizr/search?utf8=%E2%9C%93&q=browser&type=
So it seems to be a way of detecting the encoding scheme for the URL, but does anyone know for sure? Also, aren't there any simpler ways of doing this?
From a StackExchange question:
By default, older versions of IE (<=8) will submit form data in Latin-1 encoding if possible. By including a character that can't be expressed in Latin-1, IE is forced to use UTF-8 encoding for its form submissions, which simplifies various backend processes, for example database persistence.
If the parameter was instead utf8=true then this wouldn't trigger the UTF-8 encoding in these browsers.
This is a hack/feature of Rails (which Github is built with) to force IE to submit UTF-8 text.
It's a feature of Rails (which GitHub is built with), although it's not specific to Rails.

Securing application against XSS

We are currently using OWASP Antisamy project to protect our application against XSS attacks. Every input field is sanitized when any given form is submitted to server. It works fine, but we have issues with the fields like company name, organization name, etc.
Ex: Ampersand is escaped for AT&T and the company name is displayed wrong (displayed with escaped characters).
We manually update the fields on database to fix this issue. However, this is a pain in the neck as you can imagine.
Is there a way to address this using OWASP antisamy or should we use a different library?
You should only encode on output, not on input. If a user enters AT&T in your application, this should be stored at AT&T in the database. There is no need to encode it, but of course make sure that you are using parameterised queries which will prevent characters such as ' from breaking out of the SQL command context and causing SQL injection.
When you output, this is the only time you need to encode characters. e.g. AT&T should be encoded as AT&T when output to HTML, so it is displayed in the browser as AT&T.
It seems like your application is encoding the input and also encoding the output, so strings like above will be double encoded at then output as AT&amp;T in your HTML, causing the problem. Remove your input encoding to solve this.
The reason you should only encode when output is that if you decide you want to output data to a different format such as JSON or JavaScript, then the encoding is different. O'leary would become O\x27leary if encoded properly for JavaScript, which would not display properly in HTML where the encoding is O'leary.

How to compare different language String values in JAVA?

In my web application I am using two different Languages namely English and Arabic.
I have a search box in my web application in which if we search by name or part of the name then it will retrieve the values from DB by comparing the "Hometown" of the user
Explanation:
Like if a user belongs to hometown "California" and he searches a name say "Victor" then my query will first see the people who are having the same hometown "California" and in the list of people who have "California" as hometown the "Victor" *name* will be searched and it retrieve the users having "California" as their hometown and "victor" in their name or part of the name.
The problem is if the hometown "California" is saved in English it will compare and retrieve the values. But "California" will be saved as "كاليفورنيا" in Arabic. In this case the hometown comparison fails and it cant retrieve the values.
I wish that my query should find both are same hometown and retrieve the values. Is it possible?
What alternate I should think of for this logic for comparison. I am confused. Any suggestion please?
EDIT:
*I have an Idea such that if the hometown is got then is it possible to use Google translator or transliterator and change the hometown to another language. if it is in english then to arabic or if it is in english then to arabic and give the search results joining both. Any suggestion?*
The problem you encounter is that you want / need information in 2 or more languages and you want the user of your application to be able to use both languages. One possible approach is to keep multiple records per item and including a language code as part of the primary key, for instance if your record is
id hometown name
001 California Victor
you could introduce a language code and store
id lang hometown name
001 en California Victor
001 ar كاليفورنيا Victor
then your search would match either "California" or "كاليفورنيا" giving you the id 001, which you can then use to load all translations of your data (or just the data in the current output language.) This sceme can be used with any number of languages and has the added advantage that you don't need to prefill the table. You can add new translations for records when they become known.
(Caveat: I just repeated your arabic string, I can't read it, also 'ar' most likely isn't the correct language code for aribic but you get the idea.)
Does the Arabic sound like "California"? If so you will need to compare on a "sounds-like"-basis which will most likely result in a phoneme conversion.
Transliterate all names into the same language (e.g. English) for searching, and use Levenstein edit distance to compute the similarity between the phonetic representations of the names. This will be slow if you simply compare your query with every name, but if you pre-index all of the place names in your database into a Burkhard-Keller tree, then they can be efficiently searched by edit distance from the query term.
This technique allows you to sort names by how close they actually match. You're probably more likely to find a match this way than using metaphone or double-metaphone, though this is more difficult to implement.
Your Google suggestion sounds like it might also be a good one, but you should play around with it, and be sure that you're happy with its accuracy. In testing how it worked going between Hebrew and English, I noticed that sometimes Google just leaves English place names in English letters when translating to Hebrew.
How about you use some localization on client side to display values. Or create a wrapper class for hometown that will override equal(Object) in the manner the instance for California will return true for both "California" and "كاليفورنيا" (sorry if I made mistake here, just copy-pasted from above).
This sounds like a classic encoding problem. Whenever you transfer non-ascii character you need to make sure you're encoding it right. For Arabic and English I suspect you can use UTF-8 (but I don't know arabic, so it may be wrong).
In your setup you will probably have the following points:
Browser <-> Servlet container <-> Database
|
System.out
In any of the system interfaces where chars (16-bit) are converted to byte (8-bit) you will need to make sure the encoding is correct.
Browser to Servlet container
When you do GET or POST requests from a web-page, the browser will look at 1) The HTTP headers from the server, especially the Content-Type: text/html; charset=UTF-8, which if present, will override the HTML meta header <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">.
On the servlet container side, the HttpServletRequest.getParameter(), will have an encoding that you most likely need to set in the server settings.
Example tomcat's server.xml
<Connector port="8080" protocol="HTTP/1.1" URIEncoding="UTF-8"
maxThreads="2000"
connectionTimeout="20000"
redirectPort="8443" />
Servlet container to Database
The database needs to have the correct encodings, or sorting etc will not be right.
Example my.cnf for MySQL
[mysqld]
....
init_connect=''SET collation_connection = utf8_general_ci''
init_connect='SET NAMES utf8'
default-character-set=utf8
character-set-server = utf8
collation-server = utf8_general_ci
[mysql]
....
default-character-set=utf8
Then the JDBC-driver needs to be set for UTF-8.
Example JDBC connect string
jdbc:mysql://localhost:3306/rimario?useUnicode=true&characterEncoding=utf-8
System.out
System.out.printnln() can not be relied upon to verify things. First it depends on the java vm default encoding, set using System.property -Dfile.encoding=UTF-8, secondly the terminal in which you do the System.out, will need to be set to and support UTF-8. Don't trust System.out!
Once a String in the VM is a proper character, it will not be affected by encoding. In memory every char in a string is 16-bit, which (almost) covers all the chars that utf-8 can encode. You can write the string to a file and investigate the file to really know if you got correct chars in your VM.

Safe or unpractical to use UTF-8 page names or other text? - User submitted text!

I am working on a site that have an international aim; I.o.w., logged in users can add text in their own language. I am hoping for international page names and content.
An URL example, like the Japanese Wikipedia: http://ja.wikipedia.org/wiki/メインページ (Both pagename and content text).
I know by using UTF-8, I can do this, but how should I control it?
UTF-8 contains way to many languages/letters to control in a script, I guess, so how safe/unsafe is it to allow people to add UTF-8 text?
I can see that someone could add harmful code this way, but how to prevent it?
All information regarding safety/control when using UTF-8 is appreciated!
EDIT: PS! I use PHP and MySQL.
Warning: perhaps a slightly rusty response:
Note: not discussing host name (IDNS) issues.
The only completely safe thing here is to use %-escaped UTF-8. Some browsers will display this as what you want, and some will display the %-escapes. (e.g. http://foo.bar/%ee%cc%cf.html)
If you put 'real UTF-8' in the URLs, many things will work, but there may be unpleasant surprises lurking for some people in some browsers. I'm reading your question as dealing with 100% static content. If you are trying to do this with code behind the site, you have additional issues to work on.
The 'unpleasant surprises' would be (a) people finding the %xx's in the URL unreadable, (b) a browser that melts, (c) some data scraping or aggregating application that melts.
I wish I were more up to date on this, but I'm not, so my recommendation is to deploy a test site and then try to access it with everything you can put your hands on, including mobile phones. Persuade Google to index it, and see what happens there.
For domain names, this is called IDN. For page names, you may want to think of the possibility of IDN spoofs.
It's safe as long as you don't interpret it literally as SQL (SQL injection) or HTML (XSS) or any other language. Just escape any user-controlled input (request URL, request headers, request parameters, request body, etc..etc..) at the point it's going to be used in SQL or HTML.
It's unclear what server side programming language you're using, so I can't go further in detail.

Resources