Brazilian portuguese website to support russian, mandarin and japanese

Brazilian portuguese website to support russian, mandarin and japanese - web

We have a website in brazilian portuguese developed using Coldfusion (for the user interface), Hibernate (for the business logic) and Oracle database.
If we consider to support russian, mandarin and japanese languages what concerns do we must have?
Thanks in advance.

The main consideration is to make sure everything (and I mean everything OS,shell,web server, appserver, database, editors) is configured to use utf-8 or unicode by default.
If you expect a lot of asian users its slightly better to use full unicode as most chinese characters fit into a 16 bit UTF-16, but, can take up 24 or 32 bits in utf-8.
With Coldfusion and Oracle this should not present any mojor problems.
The other main consideration is how you plan to handle the internationalisation isssues.
The standard way is to keep langauge/cultural specific items in a "bundle". There are several tools out there to support this, basically you write your app in portuguese making sure all text the user will see is in quoted literals, then run the app through a utility which replaces all literals with a library call and extracts all strings into a "bundle" file. You can then edit the bundle to add other language versions of the strings. The great advantage of this is that these formats are standard and translation agencies will have the tools to edit these files -- so you can easily outsource the translation to specialists.
The other option which requires much more work but IMHO produces a nicer result is to branch of a version of the front end for each language/culture supported. This gets around a lot of problems with text height and string size. Also it handles cultural norms better -- different cultures have differnet ordering and conventions for things like address and title.
A classic example of small differences causing big problems is the Irish Republic and Post Codes, they just dont have them. So if your form validation insists on a Zip code it will annoy your Irish users. The Brits do have post codes but these are two 1 to 4 character alphanumeric strings separated by a space, not the more usual 5 or 7 digit numeric codes.

Related

Different UTF-8 signature for same diacritics (umlauts) - 2 binary ways to write umlauts

I have a quite big problem, where I can't find any help around in the web:
I moved a page from a website from OSX to Linux (both systems are running in de_DE.UTF-8) and run in an quite unknown problem:
Some of the files were not found anymore, but obviously existed on the harddrive with (visibly) the same name. All those files contained german umlauts.
I took one sample image, copied the original request-uri from the webpage and called it directly - same error. After rewriting the file-name it worked. And yes, I did not mistype it!
This surprised me and I took a look into the apache-log where I found these entries:
192.168.56.10 - - [27/Aug/2012:20:03:21 +0200] "GET /images/Sch%C3%B6ne-Lau-150x150.jpg HTTP/1.1" 304 0 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"
192.168.56.10 - - [27/Aug/2012:20:03:57 +0200] "GET /images/Scho%CC%88ne-Lau-150x150.jpg HTTP/1.1" 404 4205 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.1"
That was something for me to investigate ... Here's what I found in the UTF8 chartable http://www.utf8-chartable.de/:
ö c3 b6 LATIN SMALL LETTER O WITH DIAERESIS
¨ cc 88 COMBINING DIAERESIS
I think you've already heard of dead-keys: http://en.wikipedia.org/wiki/Dead_key If not, read the article. It's quite interesting ;)
Does that mean, that OSX saves all diacritics separate to the letter? Does that really mean, that OSX saves the character ö as o and ¨ instead of using the real character that results of the combination?
If yes, do you know of a good script that I could use to rename these files? This won't be the first page I move from OSX to Linux ...

It's not quite the same thing as dead keys, but it's related. As you've worked out, U+00F6 and U+006F followed by U+0308 have the same visual result.
There are in fact Unicode rules in knowing to treat them the same, which is based on decompositions. There's a decomposition table in the character database, that tells us that U+00F6 canonically decomposes to U+006F followed by U+0308.
As well as canonical decomposition, there are compatibility decompositions. These lose some information, for example ² ends up being decomposed to 2. This is clearly a destructive change, but it is useful for searching when you want to be a bit fuzzy (how google knows a search for fiſh should return results about fish).
If there are more than one combining character after a non-combining character, then we can re-order them as long as we don't re-order those of the same class. This becomes clear when we consider that it doesn't matter whether we put a cedilla on something and then an acute accent, or an acute and then a cedilla, but if we put both an acute and an umlaut on a letter it clearly matters what way around they go.
From this, we have 4 normalisation forms. Put strings into an appropriate normalisation form before doing comparisons, and you don't get tripped up.
NFD: Break everything apart by canonically decomposing it as much as possible. Reorder combining characters in order of their combining class, but keep any with the same class in the same order relative to each other.
NFC: First put everything into NFD. Then continually look at the combining characters in order, if there isn't an earlier one of the same class. If there is an equivalent single character, then replace them, and re-do the scan looking to compose further.
NFKD: Like NFD, but using compatibility decomposition (damaging change, but useful for comparisons as explained above).
NFD: Do NFKD, then re-combine canonical only as per NFC.
There are also some re-combinations banned from use in NFC so that text that was valid NFC in one version of Unicode doesn't cease to be NFC if Unicode has more characters added to it.
Of NFD and NFC, NFC is clearly the more concise. It's not the most concise possible, but it is one that is very concise and can be tested for and/or created in a very efficient streaming manner.
Mac OSX uses NFD for file names. Because they're weirdos. (Okay, there are better arguments than that, they just didn't convince me!)
The Web Character Model uses NFC.* As such, you should use NFC on web stuff as much as possible. There can though be security considerations in blindly converting stuff to NFC. But if it starts from you, it should start in NFC.
Any programming language that deals with text should have a nice way of normlising text into any of these forms. If yours doesn't complain (or if yours is open source, contribute!).
See http://unicode.org/faq/normalization.html for more, or http://unicode.org/reports/tr15/ for the full gory details.
*For extra fun, if you inserted something beginning with a combining long solidus overlay (U+0338) at the start of an XML or HTML element's content, it would turn the > of the tag into ≯, turning well-formed XML into gibberish. For this reason the web character model insists that each entity must itself be NFC and not start with a combining character.

Thanks, Jon Hanna for much background-information here! This was important to get the full answer: a way to convert from the one to the other normalisation form.
As my changes are in the filesystem (because of file-upload) that is linked in the database, I now have to update my database-dump. The files got already renamed during the move (maybe by the FTP-Client ...)
Command line tools to convert charsets on Linux are:
iconv - converting the content of a stream (maybe a file)
convmv - converting the filenames in a directory
The charset utf-8-mac (as described in http://loopkid.net/articles/2011/03/19/groking-hfs-character-encoding), I could use in iconv, seems to exist just on OSX systems and so I have to move my sql-dump to my mac, convert it and move it back. Another option would be to rename the files back using convmv to NFD, but this would more hinder than help in the future, I think.
The tool convmv has a build-in (os-independent) option to enforcing NFC- or NFD-compatible filenames: http://www.j3e.de/linux/convmv/man/
PHP itself (the language my system - Wordpress is based on) supports a compatibility-layer here:
In PHP, how do I deal with the difference in encoded filenames on HFS+ vs. elsewhere? After I fixed this issue for me, I will go and write some tests and may also write a bug-report to Wordpress and other systems I work with ;)

Linux distros treat filenames as binary strings, meaning no encoding is assumed - though the graphical shell (Gnome, KDE, etc) might make some assumptions based on environment variables, locale, etc.
OS-X on the other hand requires or enforces (I forget which) their own version of UTF-8 with Unicode normalization to expand all diacritics into combining characters.
On Linux when people do use Unicode in filenames they tend to prefer UTF-8 with precomposed characters when it comes to diacritics.

Will an English CAPTCHA be an issue for people in other countries?

What if I have a captcha that displays a series of English characters. Will people who don't speak English have trouble interpreting and/or typing these characters? If this is the case then what is the best solution for an internationalized captcha?

Since 99% of the URLs are in regular ASCII, I don't think you will have a problem..after all how would they get to Google or Yahoo if they couldn't type the URL
That said I have on occasion run across Chinese characters used in captchas

Image-based CAPTCHA has two main advantages over text-based CAPTCHA:
International
Harder to solve algorithmically (see PWNtcha - captcha decoder)
There are several flavors, such as:
Classification: see Captcha The Dog, KittenAuth, Microsoft Asirra
3D projection: see 3D images: A human way to create Captchas and 3D-based Captchas become reality
Detection: see Image-Based CAPTCHA from Confident Technologies and Pic-Capture
Rotation: see A Dynamic, User-Friendly Captcha With Pictures
Puzzle: see Key Captcha

It would be a problem for users using their native, non-Latin keyboard layout, for example Russians and Greeks. They would be forced to switch keyboard layout just to fill security question.
Another thing is an ability to even recognize the words - somebody who doesn't speak English could have huge problems with getting word right. Even I sometimes do (for less popular words), although I am quite proficient...
In other words, don't do this mistake, your application should be easy to use for all users.

It's definitely a concern. Dictionary-based CAPTCHAs should ideally adapt to the user's language preferences and ask them to recognize words that match their language preferences and by extension the character set they are most familiar with.
But in the absence of such internationalization, I would say that numerals and mathematical expressions are the most universal solution, and for word-based CAPTCHAs a random series of ASCII characters (which being random would be culture-neutral) would be the most accessible as pretty much any user around the world has the ability to enter these characters even if some have to switch their input method.
Now where it really gets tricky is providing accessibility alternatives for visually impaired users. Making a univeral audio CAPTCHA seems pretty much impossible (you could consider a set of universally-recognized sounds instead of spoken words, but I doubt this would provide sufficient security). And internationalized (multilingual) spoken word generation is far from trivial.

No, because English captchas are ASCII -- ASCII is always available, even if people have a Japanese, Chinese, or Russian keyboard. So this should not be a problem! And image based captchas only require the person to read the letter - and that should be possible for anybody on the web who can see, as SQLMenace pointed out.
The other way around is a problem though.
Google's reCaptcha has a little icon where the user can get a different captcha if for some reason the captcha is not readable or contains foreign characters.
I would recommend that you use Google's reCaptcha, rather than implementing it yourself.
Added Benefit:
Google's reCaptcha is also available for other languages btw. http://www.google.com/recaptcha/faq
which makes it possible for you to internationalize the captcha for the user's default locale.
EDIT:
There is a work-around for Google's reCaptcha to work with flash!
Check here:
http://groups.google.com/group/recaptcha/browse_thread/thread/e22d7e3c91bcc9db

Sure they are a problem. Would a Russian captcha be a problem for you? What about a Chinese one?
The URLs are indeed ASCII, but that is only relevant for geeks.
Regular people go to Google, type some text in their own language, and then click on one of the answers. Then never get to type an URL.

Yes, this could represent a problem to a small percentage of users. Is it a large enough problem to take into consideration when building the UI for your site to better the UX? That's up to you. If it were up to me, probably not.
To help you in the right direction though, I would use Google' reCAPTCHA. It serves a great cause and works like a charm. There's also a great API where you can customize the language that it displays. You could use PHP to detect their country and write some code to change the settings to display in their native language.
Here's a sample of changing reCATCHA's language. "fr" is french!
<script type="text/javascript">
var RecaptchaOptions = {
lang : 'fr',
};
</script>
Google reCATPCHA's API:
http://code.google.com/apis/recaptcha/docs/customization.html#i18n

I believe that the 24 letters that constitute the English alphabet correspond in most 90% of the world. We have Chinese, Japanese, Cyrillic and Arabic users however all of them have the possibility of switching to an English keyboard within their operating systems.
We have no diacritics in English which makes everything a lot easier and our system more easily adaptable all over the world. Everyone types ASCII but they are able to switch to their own zone-specific/language-specific characters.

How would one store German text in an embedded system?

I've created a memory mapped 1 bit interface to an LCD in an embedded system, along with 4 or 5 bit mapped fonts for the 90+ printable ASCII characters. Writing to the screen is as simple as using an echo like statement (it's embedded Linux).
Other than something strictly proprietory, what recommendations can people make for storing German (or Spanish, or French for that mattter)? Unicode seems to be a pretty heavy hitter.

If I understand you right, you are searching a lightwight encoding for german characters? In Europe, you normaly use Latin-1 or better ISO 8859-15. This is a 8-Bit ASCII extension containing most of the characters used by western languages.

Well, UTF-8 isn't that big. I recommend it if you want to be able to use one or more languages where you don't find a matching char in Latin-1.

Testing for Japanese/Chinese Characters in a string

I have a program that reads a bunch of text and analyzes it. The text may be in any language, but I need to test for japanese and chinese specifically to analyze them a different way.
I have read that I can test each character on it's unicode number to find out if it is in the range of CJK characters. This is helpful, however I would like to separate them if possible to process the text against different dictionaries. Is there a way to test if a character is Japanese OR Chinese?

You won't be able to test a single character to tell with certainty that it is Japanese or Chinese because of the way the unihan code points are implemented in the Unicode standard. Basically, every Chinese character is a potential Japanese character. However, the reverse is not true. Also, there are a number of conventions that could be used to test to see if a block of text is in one language or the other.
Simplifications - if the character you are testing is a PRC simplification such as 门 is only available in main land Chinese.
Kana - if the character is one of the many Japanese kana characters such as あいうえお　then the text block you are working with is definitely Japanese.
The problem arises with the sheer number of characters and words that are in common. However, if I needed a quick and dirty solution to this problem, I would check my entire blocks of text for kana - if the text contains kana then I know it is Japanese. If you need to distinguish Korean as well, I would test for Hangul. Also, if you need to distinguish what type of Chinese, testing for types of simplifications would be the best approach.

The process of developing Unicode included the Han Unification. This is because a lot of the Japanese characters are derived from, or the same as, Chinese characters; similarly with Korean. There are some characters (katakana and hiragana - see chapter 12 of the Unicode standard v5.1.0) commonly used in Japanese that would indicate that the text was Japanese rather than Chinese, but I believe it would be a statistical test rather than definitive.
Check out the O'Reilly book on CJKV Information Processing (CJKV is short for Chinese, Japanese, Korean, Vietnamese; I have the CJK predecessor lurking somewhere). There's also the O'Reilly book on Unicode Explained which may be some help, though probably not for this question (I don't recall a discussion of how to identify Japanese and Chinese text).

You probably can't do that reliably. Japanese uses a lot of the same characters as Chinese. I think the best you could do is to look at a block of text. If you see any uniquely Japanese characters, then you can assume the whole block is Japanese. If not, then it's probably Chinese.
However, I'm just learning Chinese, so I'm not an expert.

testing for characters in the katakana or hiragana ranges should be a very reliable means of determining whether or not the text is Japanese, especially if you are dealing with 'regular' user-generated text. if you are looking at legal documents or other more official fare it might be slightly more difficult, as there will be a much greater preponderance of complex chinese characters - but it should still be pretty reliable.

A workaround is to check the encoding before it is converted to Unicode.

There are many characters which are only (commonly) used in Japanese or only used in Chinese.
Japan and China both simplified many characters but often in different ways. You can check for Japanese Shinjitai and Simplified Chinese characters. There are many more of the latter than the former. If there are none of either then you probably have Traditional Chinese.
Of course if you're dealing with Unicode text you may find occasional rare characters or mixed languages which could throw off a heuristic so you're better off going with counting the types of characters to make a judgement.
A good way to find out which characters are common in one language and not in the others is to compare the legacy encodings against each other. You can find mappings of each to Unicode easily on the internet.
I used to have some code I wrote which did a binary search by codepoint and it was extremely fast even in JavaScript - I may have lost it in my travels though (-:

JavaME internationalization (i18n)

Does anyone have some knowledge with internationalization with JavaME? I'm looking for as much information as possible like examples, experiences and maybe some best practices.
Thanks

A few thoughts. J2ME doesnt support i18n as it the api support is not there (cant use resource bundles). But we can do this to a limited extent. Here is what I found out.
It is difficult if not impossible to support english and say chinese languages (typographic characters) for a given J2ME app. But easier to support english and say spanish (I forgot the correct nomenclature to talk about i18n support but you get the idea).
We can have all strings in one config class, that way you can swap this one out for different languages.
We can have the text/strings downloaded from server on initial launch of app and thus have the ability to swap it out from server.
Because of different screen sizes, it is best to work with custom fonts so that code can be written to calculate the text length while displaying it. This will make multiple language support easier.
Image assets can also be downloaded from server based on different languages. But I dont think we can change the midlet icon, so it should be generic.
With this in mind it is possible to design multiple language support.

omemuhammed's answer is an excellent coverages of localization in the mobile space.
I've only had to support EFIGS (English, French, Italian, German, and Spanish). We stored all strings in XML, had an XML pack for each language. We would then compile these XML packs into proprietary binary data and had the ability to build either all 5 languages into a build, have only one language for the build in cases where the application size was tight, or download the binary from a server.
Other considerations with localization is screen layout. I also recommend custom fonts in order to have better control of the display across many different devices. You will need some auto-wrapping code to be able to adjust to different screen resolutions and aspect ratios and you will need a way to handle strings that run off the screen on some devices. Either paging or or scrolling would be a good solution.
Finally, just know that German will screw up your formatting. Try to allow 20-30% padding in English for menus and other UI elements as the German translations will be much longer than the other languages.

See the actual internationalization spec for JavaME: http://www.jcp.org/en/jsr/detail?id=238
Recent Symbian phones should support this.
One obvious advice is to actually try your application on a localized phone: Get a phone from switzerland (it should support at least 4 languages) and another from hong kong (with 3 different version of "chinese"). It might be worth looking into eastern europe/ex-ussr too.
When the actual characters aren't your usual ascii ones, you do need to use a TextBox or TextField in order to have the localized native control on the screen.

Keep in mind that when you use RTL (righ to left) languages, like arabic, you should invert positions of almost everything on the screen, like a list would look like this on latin languages
List item 1
List item 2
List item 3
but the bullets would be on the other side of the screen on arabic (tried make it here, but I could´n generate na inverted list :P )
One other thing is that is better to store your strings in a class than a plain text properties, as this may cause some errors interpreting the unicode of some languages depending on the OS and text editor you are using.
What I usually do is have a I18nManager Class that stores the language in the startApp and through this class I get all the strings I need.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string