I often see the abbreviation "en-US", which corresponds with the 2-character language codes standardized in ISO639-1. I also understand that the format of language tags generally consists of a primary language (subtag) code, followed by a series of other subtags separated by dashes, as explained in https://www.rfc-editor.org/rfc/rfc5646.
That link mentions that there are also 3-letter language codes defined in ISO639-2, ISO639-3, and ISO639-5.
Still, there are more codes defined for Windows/.NET here: http://msdn.microsoft.com/en-us/goglobal/bb896001.aspx. These refer to the language tags as "culture names", and use a distinct 3-character code for "language name". So the "culture name" appears to be the 2-character language codes, although I'm not sure why they vary between Windows versions, or how well they follow the standard language codes. Is "en-US" really a "language code" or is it a "culture name"?
If I'm developing software to use language codes, which standard should I use? (The 2-character codes or the 3-character codes? If 3-character, then ISO639- 2, 3, or 5?)
Why should I chose one over the other? (For OS platform or programming framework compatibility?)
Bcp47 is the industry best practice standard for identifying languages. You should use these language tags. Bcp47 dictates that if a language can be identified using a 2 letter or 3 letter tag, the 2 letter tag should be used.
Cultures and locales are distinct from language tags in how they conceive of the region information. The region information in a language tag identifies the origin of the particular dialect (en-US is American English or the variety of English that originated in the United States), the region information in a locale identifies the location where the information is relevant. Since the majority of American English speakers also live in the US, the distinction is not really important when it comes to providing information such as how to spell words or format dates or numbers.
Windows is moving away from the concept of a locale or culture to a more expressive notion of language and region (separately identified) which allows us to identify situations such as a speaker of American English who resides in England.
Note that there are cases where Windows still uses legacy names that predate this standard and depending on how you rely on the OS, you may need to map between standard compliant names and the legacy name.
Related
I know there are libraries like bestiejs/punycode.js or NodeJS PunnyCode to convert punycode, but I can't find any library that detect punycode languages(Geek, Chinese, etc).
Is that possible to detect punycode language natively or it has to use different software to detect the languages.
Also, is there any NodeJs library can use for punycode language detection?
The punycode is the ASCII (8 bit) representation of an otherwise 16 bit Unicode based Internationalized Domain Name. The conversion to punycode is termed as a variable length encoding and is a mathematical process, involving additional processings like case-folding and normalization to Unicode Form C. Owing to the mathematical nature of the punycode, the language information, as such is not supposed to be part of the punycode representation as such at all. It is the Unicode equivalent of the given punycode, that lies in specific Unicode range/block which gives the given character it's own script/language.
Hence, if one needs to have language/script detection capability of the IDN, then it needs to be converted to it's U-Label form first and then passed on to language/script detection routines.
To know about the various libraries that can be used in different programming languages for converting punycodes to their respective Unicode labels, please refer to the following two documents created by the "Universal Acceptance Steering Group"
UASG 018A UA Compliance of Some Programming Language Libraries and
Frameworks
(https://uasg.tech/download/uasg-018a-ua-compliance-of-some-programming-language-libraries-and-frameworks-en/ as well as
UASG 037 UA-Readiness of Some Programming Language Libraries and Frameworks EN"
(https://uasg.tech/download/uasg-037-ua-readiness-of-some-programming-language-libraries-and-frameworks-en/).
One problem in a project with Domain Driven Design:
In discussions about the domain model, many terms of the Ubiquitous Language (UL) are used in German by the team members (all German speakers) , whereas the English version is used within the analysis model and the code model.
What is good practise to handle this issue? Should we force us to use the English term in discussions also, or is it ok to translate the term for modeling and implementation?
I've also worked in multiple, german DDD projects.
In my experience, the team usually automatically starts to use the english terms as soon as there have been first discussions and a first implementation. It helps if you maintain an up-to-date glossary of the german terms and their english counterparts especially to help new team members.
I think UL is just another language, like german or english are. The point is that it has to be spoken and understood by all team members. And it has to be used every where, discussions, documents, diagrams, source code, ...
You cannot use a german term in discussions and its english translated term in source code.
Working in Germany, I was involved in a banking project once where the Team (all Germans) decided to use English to write code, on the off chance that some offshore team might also later be involved.
A couple of months later we already struggled with our English dictionary. We noticed that we (and the business people) would have no problems with the German words, but we ourselves did not always knew the correct translations in the code. Some concepts (like legal terms, etc.) don't even always have a correct translation.
It got even worse. As some offshore colleagues were involved, they started asking for the German words behind some things in the code, because they could google the German words, but not always the English ones. (Might be caused by our bad translations to English...)
Having gone through all of that, I would take whatever the "source" language is into the code. If the source material (requirements, legal framework, business team, etc.) is in German, then German.
The only argument against using German is that it looks and sounds really funny and strange. Oh, and Unicode support is a must. Class "Überweisungsvorlage" vs. CashTransferTemplate.
I know of another Team which combined English verbs in methods with German concepts, like "createÜberweisungsvorlage", because German verbs are quirky and do not lend themselves to short and uniform conventions.
Summary: You decide, and good luck :)
Use whichever language your domain experts use. If many, ask them to choose the one that defines the domain best.
The best language can be different from one subdomain/Bounded Context to another.
In my application I have unicode strings, I need to tell in which language the string is in,
I want to do it by narrowing list of possible languages by determining in which range the characters of string are.
Ranges I have from http://jrgraphix.net/research/unicode_blocks.php
And possible languages from http://unicode-table.com/en/
The problem is that algorithm has to detect all languages, does someone know more wide mapping of unicode ranges to languages ?
Thanks
Wojciech
This is not really possible, for a couple of reasons:
Many languages share the same writing system. Look at English and Dutch, for example. Both use the Basic Latin alphabet. By only looking at the range of code points, you simply cannot distinguish between them.
Some languages use more characters, but there is no guarantee that a
specific piece of text contains them. German, for example, uses the
Basic Latin alphabet plus "ä", "ö", "ü" and "ß". While these letters
are not particularly rare, you can easily create whole sentences
without them. So, a short text might not contain them. Thus, again,
looking at code points alone is not enough.
Text is not always "pure". An English text may contain French letters
because of a French loanword (e.g. "déjà vu"). Or it may contain
foreign words, because the text is talking about foreign things (e.g.
"Götterdämmerung is an opera by Richard Wagner...", or "The Great
Wall of China (万里长城) is..."). Looking at code points alone would be
misleading.
To sum up, no, you cannot reliably map code point ranges to languages.
What you could do: Count how often each character appears in the text and heuristically compare with statistics about known languages. Or analyse word structures, e.g. with Markov chains. Or search for the words in dictionaries (taking inflection, composition etc. into account). Or a combination of these.
But this is hard and a lot of work. You should rather use an existing solution, such as those recommended by deceze and Esailija.
I like the suggestion of using something like google translate -- as they will be doing all the work for you.
You might be able to build a rule-based system that gets you part of the way there. Build heuristic rules for languages and see if that is sufficient. Certain Tibetan characters do indicate Tibetan, and there are unique characters in many languages that will be a give away. But as the other answer pointed out, a limited sample of text may not be that accurate, as you may not have a clear indicator.
Languages will however differ in the frequencies that each character appears, so you could have a basic fingerprint of each language you need to classify and make guesses based on letter frequency. This probably goes a bit further than a rule-based system. Probably a good tool to build this would be a text classification algorithm, which will do all the analysis for you. You would train an algorithm on different languages, instead of having to articulate the actual rules yourself.
A much more sophisticated version of this is presumably what Google does.
I am trying to get corpus for a certain language. But when I get a webpage, how can I determine the language of it?
Chrome can do it, but what's the principle?
I can come up with some ad-hoc methods like educated guess based on characters set, IP address, HTML tags etc. But more formal method?
I suppose the common method is looking at things like letter frequencies, common letter sequences and words, character sets (as you describe)... there are lots of different ways. An easy one would be to just get a bunch of dictionary files for various languages and test which one gets the most hits from the page, then offer, say, the next three as alternatives.
If you are just interested in collecting corpora of different languages, you can look at country specific pages. For example, <website>.es is likely to be in Spanish, and <website>.de is likely to be in German.
Also, Wikipedia is translated into many languages. It is not hard to write a scraper for a particular language.
The model that determines a webpage's language in Chrome is called the Compact Language Detector v3 (CLD3) and it's open source C++ code (sort of, it's not reproducible). There's also official Python bindings for it:
pip install gcld3
For production of a set of mobilephone/ smartphone minisites, what do you recommend as a technology to automatically choose the language of the site:
browser IP address
mobile browser language request header
any method related to device specifics or Carrier specifics of a certain country?
any other method
The languages that will be targeted are:
Vietnamese
German
Thai
Arabic
Spanish
Indonesian
Italian
Japanese
Chinese, both traditional/ Simplified
Korean
Russian
I understand the answers may vary per language, so feedback on all or any language would be greatly appreciated.
Anything other that what a user has specifically requested is a bad idea. So, for example, using geographical IP lookup is a terrible idea. People may live in a country where multiple languages are spoken, or may simply prefer the lingua franca English, and might find it extremely annoying when another language is forced upon them.
Out of the options you mentioned only the browser language request header sounds like something a user might actually configure on his own. All other options I suspect will produce an inferior experience for large portions of the target audience.