Localisation practices, how to handle difference between language and region

Localisation practices, how to handle difference between language and region - web

I'm developing a website that will require different content for different regions.
Different regions also have a preferred language (defaulting to English), but may have multiple.
For example, Taiwan and Hong Kong pages have different content, despite having the same preferred language (Traditional Chinese). Each region may have targeted content, but a lot of the content would overlap with each other and other regions. Furthermore, Hong Kong would also want Hong Kong content to be able to display in English.
As someone new to localisation, do existing l10n libraries typically handle these kinds of cases and demarcate between region and language. Would you have to copy the language specific content multiple times for each region? Or would you just create one language file (or however the language strings are stored) for each language, and different regions just pull relevant content from the same language files depending on what that region needs?
I am likely going to be using a CMS (currently thinking Silverstripe), but I haven't decided yet as I am still figuring out the requirements (including localisation).
Thanks!

It's an interesting, but very broad question.
I would draw a distinction here between content (different pages, sections, menus..) and translations (different versions of the content).
In the sense that i18n libraries display translations within a common template - the locale (language+region) tends to be one and the same thing. Taking your example, you described three locales: zh-HK, en-HK and zh-TW. (Possibly en-TW too, although that wasn't clear).
The question is whether these locales are merely translations - Is it enough to simply create three versions of the same bits? If so, a common approach to the overlap factor is fallback. i.e. You might allow your zh-TW locale to fall back to the zh-HK in cases where they can be identical.
If that approach works, then check if your chosen CMS supports per-language fallback (as opposed to a single global fallback to English, as is common)
However, if the content differs wildly between regions (totally different pages, menus etc..) then I would say it's typical to run separate instances of your CMS. hk.example.com and tw.example.com both available in whatever language translations you see fit. This will probably prevent sharing of content when it overlaps. That's the case in every CMS I've worked with but perhaps someone else can tell you differently.

Related

Getting local language code from lat lon cordinates

I would like to use the current latlon centre of a Leaflet map to detect the local language so that I can choose which Wikipedia version to query against. Currently I am just querying the English version but that limits the results. For example in Tallin ...
English - 80 results - https://en.wikipedia.org/w/api.php?action=query&list=geosearch&gsradius=1000&gslimit=500&gscoord=59.436682840436205%7C24.747337102890015
Estonian - 109 results - https://et.wikipedia.org/w/api.php?action=query&list=geosearch&gsradius=1000&gslimit=500&gscoord=59.436682840436205%7C24.747337102890015
Also, the place names are returned in the local language, which is a good thing for me.
I appreciate that sometimes there may be ambiguity, and other times maybe I'll get a language that there isn't a Wikipedia for (list at http://meta.wikimedia.org/wiki/List_of_Wikipedias#All_Wikipedias_ordered_by_number_of_articles), but I can code for that and default to English as a fallback.
By way of background:
I have a prototype app at http://postcodepast.com which aims to allow discovery of cultural heritage content local to where people search (their home town for example). It allows them to confirm locations (i.e. to geotag them), possibly improve the accuracy, or simply reject it when there are false matches. All the crowdsourced data generated from this will be released as open data (currently a geoJSON feed).
The way it does this is to look up local landmarks from Wikipedia and then search for these names across cultural heritage providers like Europeana, DPLA and Flickr Commons. I could equally use other sources of landmarks (e.g. a list of geotagged WW1 battlefields) or other providers (with wider geographical coverage, or maybe specific content). Hopefully you get the idea.

Multi Language Approaches with ExpressionEngine

So it's been two years since I last built a multi language site and I'm starting a new one strait away.
The last site I built I used the biber multi language module which seems to have had a name change since then and is now called Multi Language Support. With the last build I used matrix and an a drop down to set the language setting for each row, thus ensuring that each page was all in one entry and it would be easy to add additional languages, just by adding an additional row to the matrix field.
My question is twofold. Are there other approaches to consider at this point, since it's been two years since I last built a multi language site.
Secondly with the last site all the SEO (meta description, meta keywords and title tags were in English only) The new site will need all of this in both English and French. I typically use the SEO Lite module to handle SEO, but I don't see a way to have multiple languages available with this. Is there a simple way to set this up or will I need to go with matrix fields as mentioned above?
Am open to any and all approaches that are available for me to evaluate.
** UPDATE **
Will not be using Structure so that does not need to be a factor.

I don't have a long-winded answer, but Publisher from Boldminded looks quite nice and full of features. And it's on sale.
Brian is a great dev with great support. Based on past experiences with support I highly recommend taking a look.
http://boldminded.com/add-ons/publisher

I have been using a few different approaches:
Transcribe is have been using recently. It works very well with template rewriting, language variables. It's pretty easy to set up, but is also sometimes a bit too powerfull. It's supposed to work well with structure, but i have not yet really investigated it. one of the benefits is that you are not required to have all pages in all languages.
Structure
With structure and Low Variables you can easily set up a tree like:
home
/about us
/products
/nl/home
/nl/over ons
/nl/producten
You can reuse your field groups and not all content has to exist in the different languages

I would say it would depend on the size of the website. For a small site building out two columns out in matrix would be perfect like you have done. I have done this with a Mobile that simply had a column for English and for Spanish for all content elements as well as meta tags too. Then I simply passed a variable based on what language the visitor initially selected.
But I would say going with Multi Site Manager is probably your best bet. It's true that you will need to purchase some additional addons if you're using them on your 'default' site, but it's very easy to manage. Plus won't you want your URL structures to be different? For SEO purposes you'd really want to have spanish/french or whatever language to have its own unique URL.
I typically will use MSM and the NSM Better Meta addon. I would say this offers the most flexibility no matter the size of the site.

my approach on a couple of sites, regardless of size, has been:
Low Variables for in-template language variables
strategic planning of template group and template names (since it's the one
thing that's a bit more difficult to handle translation on with
anything other than htaccess)
a "languages" channel for extensibility
each channel has a language field which uses a radio button drawn from the languages channel so the language is assigned to the entry (and then used in the entries loop as a limiter)
Playa to relate the entry "equivalents" together with one another (i.e. allow a one to one language switch WITHOUT all the languages being in the same entry) - a relationship field is technically the better fix, but Playa has searchability where relationship fields do not
SEO handled with custom fields in the channels, with conditional fall back to a default set in Low Variables
This approach is what I preferred (I have used Transcribe but I did find it a bit resource intensive in terms of queries where this approach is a bit lighter, especially with caching). It allows me to maintain consistent enforcement and validation of required fields, for example, and allows for a variety of other benefits - language-specific entry url titles (without having to fore the content editor to create them manually), asynchronous content, asynchronous content translation (which always seems to be an issue - english is available before french is, awaiting translation, for example), separate workflows potentially, etc.
You can see this approach in action at www.cps.ca

I've done a number of multi-language sites with EE, albeit never with the Bieber module. My preference is to use Republic Variables for creating a variable matrix for labels (then it's just a simple flexible tag on the page). There's a bit of set-up that needs to be done, but once you've done it a couple of times it's only 5 minutes work:
A basic overview of steps (I've begun documenting them on my very old EE site):
1) Use the .htaccess method to remove index.php from URLs (making them clean) and in EE, set the system to use the title as the article link
2) Create ANSI directories for each language and move copies of index.php and .htaccess where
the system path is corrected:
$system_path = '..system';
and in the language director, create a .htaccess to relaunch queries with the current language:
RewriteEngine on
RewriteCond $1 !^(index\.php) [NC]
RewriteRule ^(.*)$ /ru/index.php/$1 [L]
(check my site for more detailed directions)
3) Install the wonderful Republic Variables and set for the number of languages you need.
(_en for English, _ru for Russian, _es for Spanish, etc..) and make the same one default as the default language in your index.php. Under configuration, I prefer to set for a Language Postfix. Add a variable "teaser" for testing, and fill in all languages.
4) On the page, drop in a tag with this format: {variable_{language}}, e.g.
{teaser_{language}
and you should see the default language variable. Insert the language in the URL before the template/page (e.g., www.sitename.com/ru/directory/template) and the language will switch on the fly. I'll be documenting this in a follow-up post this weekend.

I think MSM could also be considered as an option for Multilingual sites, because of the following factors:
For content rich sites, MSM is really easier for content contributors, which are usually responsible for a signle language. The one language == one site equation is an easy one to understand
Multilingual websites tend to cater for various audiences. Those audiences needs usually evolve over the 3 or 4 years your site will last. MSM offers you the possibility to add sections or functionalities for a specific language down th line, which is a lot more difficult with a more entangled data structure
As a developer, an MSM setup offers you a lot of flexibility. Adding or deleting languages is as easy as copying, deleting an MSM site, SEO tend to be easier since you can translate your template / template groups / url_titles quickly. If your template structure is clean and if you use add-ons like low variables or even global variables to avoid having content in your templates, maintaining various sets of templates is really easy. It can also be a bonus if one day you need to deal with right to left languages or something like that.
Here is an example of a site I maintain

I also want to point out the Multi-Language Episodes on the EE Podcast:
http://ee-podcast.com/episodes/tag/multi-lingual
Ep 49 actually has Transcribe's Tom Jaeger chatting about multi-language process and gotchas.
Ep 54 has Nicolas Bottari chat about his processes

Where can I find a good introduction to locales

I have to write some code working with locales. Is there a good introduction to the subject to get me started?

First posted at Everything you need to know about Locales
A long time ago when I was a senior developer in the Windows group at Microsoft, I was sent to the Far East to help get the F.E. version of Windows 3.1 shipped. That was my introduction to localizing software – basically being pushed in to the deep end of the pool and told to learn how to swim. This is where I learned that localization is a lot more than translation.
Note: One interesting thing we hit - the infamous Blue Screen of Death switched the screen into text mode. You can't display Asian languages in text mode. So we (and by we I mean me) came up with a system where we put the screen in VGA mode, stored the 12 pt. courier bitmap at the resolution for just the characters used in BSoD messages, and rendered it that way. You kids today have it so easy J.
So keep in mind that taking locale into account can lead to some very unexpected work.
The Locale
Ok, so forward to today. What is a locale and what do you need to know? A locale is fundamentally the language and country a program is running under. (There can also be a variant added to the country but use of this is extremely rare.) The locale is this combination but you can have any combination of these two parts. For example a Spanish national in Germany would set es_DE so that their user interface is in Spanish (es) but their country settings are in German(DE). Do not assume location based on language or vice-versa.
The language part of the locale is very simple - that's what language you want to display the text in your app in. If the user is a Spanish speaker, you want to display all text in Spanish. But what dialect of Spanish - it is quite different between Spain and Mexico (just as in America we spell color while in England it's colour). So the country can impact the language used, depending on the combination.
All languages that support locale specific resources (which is pretty much all of them today) use a fall-back system. They will first look for a resource for the language_country combination. While es_DE has probably never been done, there often is an es_MX and es_ES. So for a locale set to es_MX it will first look for the es_MX resource. If that is not found, it then looks for the es resource. This is the resource for that language, but not specific to any country. Generally this is copied from the largest country (economically) for that language. If that is not found, it then goes to the "general" resource which is almost always the native language the program was written in.
The theory behind this fallback is you only have to define different resources for the more specific resources - and that is very useful. But even more importantly, when new parts of the UI are made and you want to ship beta copies or you release before you can get everything translated, well then the translated parts are in localized but the untranslated parts still display - but in English. This annoys the snot out of users in other countries, but it does get them the program sooner. (Note: We use Sisulizer for translating our resources - good product.)
The second half is the country. This is used primarily for number and date/time settings. This spans the gamut from what the decimal and thousand separator symbols are (12,345.67 in the U.S. is 12 345,67 in Russia) to what calendar is in use. The way to handle this is by using the run-time classes available for all operations on these elements when interacting with a user. Classes exist for both parsing user entered values as well as displaying them.
Keep a clear distinction between values the user enters or are displayed to the user and values stored internally as data. A number is a string in an XML file but in the XML file it will be "12345.67" (unless someone did something very stupid). Keep your data strongly typed and only do the locale specific conversions when displaying or parsing text to/from the user. Storing data in a locale specific format will bite you in the ass sooner or later.
Chinese
Chinese does not have an alphabet but instead has a set of glyphs. The People's Republic of China several decades ago significantly revised how to draw the glyphs and this is called simplified. The Chinese glyphs used elsewhere continued with the original and that is called traditional. It is the exact same set of characters, but they are drawn differently. It is akin to our having both a text A and a script A - they both mean the same thing but are drawn quite differently.
This is more of a font issue than a translation issue, except that wording and usage has diverged a bit, in part due to the differences in approach between traditional and simplified Chinese. The end result is that you generally do want to have two Chinese language resources, one zh_CN (PRC) and one zh_TW (Taiwan). As to which should be the zh resource - that is a major geopolitical question and you're on your own (but keep in mind PRC has nukes - and you don't).
Strings with substituted values
So you need to display the message Display ("The operation had the error: " + msg); No, no, no! Because in another language the proper usage could be Display("The error: " + msg + " was caused by the operation"); Every modern run-time library has a construct where you can have a string resource "The operation had the error: {0}" and will then substitute in your msg at {0}. (Some use a syntax other than {0}, {1}, …)
You store these strings in a resource file that can be localized. Then when you need to display the message, you load it from the resources, substitute in the variables, and display it. The combination of this, plus the number & date/time formatters make it easy to build up these strings. And once you get used to them, you'll find it easier than the old approach. (If you are using Visual Studio - download and install ResourceRefactoringTool to make this trivial.)
Arabic, Hebrew, and complex scripts.
Arabic & Hebrew are called b-directional because parts of it are right to left while other parts are left to right. The text in Arabic/Hebrew are written and read right to left. But when you get to Latin text or numbers, you then jump to the left-most part and read that left to right, then jump back to where that started and read right to left again. And then there is punctuation and other non-letter characters where the rules depend on where they are used.
Here's the bottom line - it is incredibly complex and there is no way you are going to learn how it works unless you take this on as a full-time job. But not to worry, again the run-time libraries for most languages have classes to handle this. The key to this is the text for a line is stored in the order you read the characters. So in the computer memory it is in left to right order for the order you would read (not display) the characters. In this way everything works normally except when you display the text and determine moving the caret.
Complex scripts like Indic scripts have a different problem. While they are read left to right, you can have cases where some combinations of letters are placed one above the other, so the string is no wider on the screen when the second letter is added. This tends to require a bit of care with caret movement but nothing more.
We even have cases like this in English where ae is sometimes rendered as a single æ character. (When the human race invented languages, they were not thinking computer friendly.)
Don't Over-Stress it
It seems like a lot but it's actually quite simple. In most cases you need to display text based on the closest resource you have. And you use the number & date/time classes for all locales, including your native one. No matter where you live, most computer users are in another country speaking another language - so localizing well significantly increases your potential market.
And if you're a small company, consider offering a free copy for people who translate your product. When I created Page 2 Stage I offered a free copy (list price $79.95) for translating it - and got 28 translations. I also met some very nice people online in the process. For an enterprise level product, many times a VAR in another country will translate it for you at a reduced rate or even free if they see a good market potential. But in these cases, do the first translation in-house to get the kinks worked out.
One resource I find very useful is the Microsoft Language Portal where you can put in text in English and if that text is in any of the Microsoft products, it will give you the translation Microsoft used for a given language. This can give you a fast high-quality translation for up to 80% of your program in many cases.
Удачи! (Good Luck)

Any flexible CMS perfect for restaurant website’s back-end?

I’m building a website for a restaurant which consists of several static pages like ‘About us’ and editable menu.
I need a CMS flexible enough to be able to add items individually (by individually, I mean adding items doesn’t equal pasting a HTML list of n products into another static page).
Each item should contain its name, description, price and category. The list of added items should be displayed using templates the way I want them to.
Can you suggest any lightweight CMS which can provide similar conditions?

There are tons of options for simple page creation. Have you considered just using one of the many free website builders out there? Then you don't even have to worry about finding hosting, just make it happen quickly and easily with one of them. For instance, take a look at Weebly (review here) or Wix. Both allow for free pages and both are incredibly easy to use. Squarespace (review here) is another solid option (and one of my favorites) but charges a small fee (which I personally think is worth it).
Weebly allows for some slick drag and drop of page elements into place as does Wix. They are what I would classify as the easiest of the batch while Squarespace provides for an excellent user interface experience.
Other options if you'd prefer something hosted on your own would depend on your experience level. I am a huge fan of Processwire and ImpressPages has come along nicely and is great little CMS too.
These are exceptions to the typical Top Three that everyone tends to recommend I know but I like to spread the word about other projects instead of the usual ones.
Cheers!
Mike

Sounds like a job for Wordpress 3.0 plus Custom Post Types UI + Verve Meta Boxes plugins. Wordpress will handle the static pages, the other two plugins will allow you to make a Menu Item post type with custom fields.

It is not exactly lightweight, but you could do it with Drupal. You can define you own content type "product", use the CCK module to add your fields (price, ...) and use the Views module to display it how you want.
Drupal has a relatively steep learning curve, so it may be overkill for this project. It is definitely flexible enough for this, though.

Preferable Tag Cloud Visualization Formats

Out of curiosity, I would love to know what tag clouds formats best serve the purpose of discovery of more and more (relevant)content?
I am aware of 3 formats, but don't know which one is the best.
1) delicious one - color shading
2) The standard one with font size variations -
3) The one on this site - numbers showing importance/usage.
So which ones do you prefer? and why?
Edit:
Thanks to the answers below, I now have much more understanding of tag cloud visualization techniques.
4) Parallel Tag Clouds - a simple use of parallel coordinates technique. I find it more organized and readable.
5) voroni diagram - more useful for identifying tag relationships and making decisions based on them. Doesn't serves our purpose of discovery of relevant content.
6) Mind maps - They are good and can be employed to step by step filter content.
I found some more interesting techniques here - http://www.cs.toronto.edu/~ccollins/research/index.html

I really do think that depends on the content of the information and the audience. What's relevant to one is not relevant to another. If an audience is more specialized, then they will be more likely to think along the same lines, but it would still need to be analyzed and catered to by the content provider.
There are also multiple paths that a person can take to "discover more". Take the tag "DNS" for example. You could drill down to more specific details like "UDP Port 53" and "MX Record", or you could go sideways with terms like "IP address" "Hostname" and "URL". A Voronoi diagram shows clusters, but wouldn't handle the case where general terms could be related to many concepts. Hostname mapping to "DNS", "HTTP", "SSH" etc.
I've noticed that in certain tag clouds there's usually one or two items that are vastly larger than the others. Those sorts of things could be served by a mind map, where one central concept has others radiating out from it.
For the cases of lots of "main topics" where a mind map is inappropriate, there are parallel coordinates but that would be baffling to many net users.
I think that if we found an extremely well organized way of sorting clusters of tags while preserving links between generalities and specificities, that would be somewhat helpful to AI research.
In terms of which I personally prefer, I think the numeric approach is nice because infrequently referenced tags are still presented at a readable font size. I also think SO does it this way because they have vastly more tags to cover than the average size based cloud a la the standard.

I would go with #2 out of the options you listed above.
1 - The human eye recognizes and comprehends size differences much more effectively than color, when the color scale is along the same spectrum (ie, various blues as opposed to discrete individual colors).
3 - Requires the user to scan the full list and mathematically compare each individual number while scanning. No real meaningful relationship between tags without a lot of work on the users part.
So, going with #2, there are several considerations to take into account:
Keep the tags alphabetical. This affords the user another method of searching and establishes a known relationship between each (assuming they know the alphabet!). If they're unordered, it's just a crapshoot to find a single one.
If size comparison is absolutely critical (this usually isn't the case, as you can scale up each level by a certain percentage or pixel amount), use a monospaced font. Otherwise, certain letter combinations may end up looking larger than they actually are.
Don't include any commas, pipes, or other dividers. You're already going to have a lot of data in a small area - no need to clutter it up with debris. Space the tags out with a decent amount of padding, of course. Just don't double the number of visual elements by adding more than just the data.
Set a min/max font size and scale between those. There are situations where one tag may be so popular that visually it may appear exponentially larger than the others. Likewise, you don't want a tag to end up rendering at 1px! Set the min/max and adjust between as necessary.

size adjusted voroni diagram
- it shows which tags are inter-related

My favorite tag cloud format is the Wordle format. It looks great and it also does a pretty good job of fitting a lot of tags in a small space.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string