Getting local language code from lat lon cordinates - locale

I would like to use the current latlon centre of a Leaflet map to detect the local language so that I can choose which Wikipedia version to query against. Currently I am just querying the English version but that limits the results. For example in Tallin ...
English - 80 results - https://en.wikipedia.org/w/api.php?action=query&list=geosearch&gsradius=1000&gslimit=500&gscoord=59.436682840436205%7C24.747337102890015
Estonian - 109 results - https://et.wikipedia.org/w/api.php?action=query&list=geosearch&gsradius=1000&gslimit=500&gscoord=59.436682840436205%7C24.747337102890015
Also, the place names are returned in the local language, which is a good thing for me.
I appreciate that sometimes there may be ambiguity, and other times maybe I'll get a language that there isn't a Wikipedia for (list at http://meta.wikimedia.org/wiki/List_of_Wikipedias#All_Wikipedias_ordered_by_number_of_articles), but I can code for that and default to English as a fallback.
By way of background:
I have a prototype app at http://postcodepast.com which aims to allow discovery of cultural heritage content local to where people search (their home town for example). It allows them to confirm locations (i.e. to geotag them), possibly improve the accuracy, or simply reject it when there are false matches. All the crowdsourced data generated from this will be released as open data (currently a geoJSON feed).
The way it does this is to look up local landmarks from Wikipedia and then search for these names across cultural heritage providers like Europeana, DPLA and Flickr Commons. I could equally use other sources of landmarks (e.g. a list of geotagged WW1 battlefields) or other providers (with wider geographical coverage, or maybe specific content). Hopefully you get the idea.

Related

Localisation practices, how to handle difference between language and region

I'm developing a website that will require different content for different regions.
Different regions also have a preferred language (defaulting to English), but may have multiple.
For example, Taiwan and Hong Kong pages have different content, despite having the same preferred language (Traditional Chinese). Each region may have targeted content, but a lot of the content would overlap with each other and other regions. Furthermore, Hong Kong would also want Hong Kong content to be able to display in English.
As someone new to localisation, do existing l10n libraries typically handle these kinds of cases and demarcate between region and language. Would you have to copy the language specific content multiple times for each region? Or would you just create one language file (or however the language strings are stored) for each language, and different regions just pull relevant content from the same language files depending on what that region needs?
I am likely going to be using a CMS (currently thinking Silverstripe), but I haven't decided yet as I am still figuring out the requirements (including localisation).
Thanks!
It's an interesting, but very broad question.
I would draw a distinction here between content (different pages, sections, menus..) and translations (different versions of the content).
In the sense that i18n libraries display translations within a common template - the locale (language+region) tends to be one and the same thing. Taking your example, you described three locales: zh-HK, en-HK and zh-TW. (Possibly en-TW too, although that wasn't clear).
The question is whether these locales are merely translations - Is it enough to simply create three versions of the same bits? If so, a common approach to the overlap factor is fallback. i.e. You might allow your zh-TW locale to fall back to the zh-HK in cases where they can be identical.
If that approach works, then check if your chosen CMS supports per-language fallback (as opposed to a single global fallback to English, as is common)
However, if the content differs wildly between regions (totally different pages, menus etc..) then I would say it's typical to run separate instances of your CMS. hk.example.com and tw.example.com both available in whatever language translations you see fit. This will probably prevent sharing of content when it overlaps. That's the case in every CMS I've worked with but perhaps someone else can tell you differently.

Search feature on website

I am interested in implementing a search feature on a website. It is a location search, so address/state/zip all should work. Which will then show results in that area and allow it to be filtered.
My question is:
What's the best approach for something like this?
There are literally dozens of ways of doing this (if not more). The exact implementation would depend on the technology stack that you use, but as a very top level overview:
you'd need to store the things you are searching for somewhere, and tag them with a lat/long location. Often, this would be in a database of some kind.
using a programming language, you would need to write a search that accepts a postcode, translates that to a lat/long and then searches the things in your database based on the distance between the location of the thing, and the location entered in the search.
if you want to support filtering, your search would need to support that too. This is often called "faceting" the search.
Working out the lat/long locations will need to be done using a GeoLocation service, there are some, such as PostCode Anywhere that will do this as a paid service, and others that are free (within reason), such as the Google Maps APIs.
There are probably some hosted services that will do what you want, you'd have to shop around.
Examples of search software that supports geolocation searching out of the box are things like Solr, Azure Search, Lucene and Elastic.

How to embed basic weather report for current time for fixed location in web page?

What I need:
I need to output a basic weather reports based on the current time and a fixed location (a county in the Republic of Ireland).
Output requirements:
Ideally plain text accompanied with a single graphical icon (e.g.
sun behind a cloud etc.).
Option to style output.
No adverts; no logos.
Free of charge.
Numeric Celsius temperature and short textual description.
I appreciate I'm that my expectations are high so interpret the list more as a "wish-list" rather than delusional demands.
What I've tried:
http://www.weather-forecast.com - The parameters for the iframe aren't configurable enough. Output is too bloated.
Google Weather API - I've played with PHP solutions to no avail though in any case, apparently the API is dead: http://thenextweb.com/google/2012/08/28/did-google-just-quietly-kill-private-weather-api/
My question:
Can anyone offer suggestions on how to embed a simple daily weather report based on a fixed location with minimal bloat?
Take a look at http://www.zazar.net/developers/jquery/zweatherfeed/
It's pretty configurable, although I'm not sure if there is still too much info for your needs. I've only tried it with US locations; all you need is a zipcode. The examples show using locations from other countries. I'm assuming it's a similar setup to get locations added for Ireland.

what algorithm does freebase use to match by name?

I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get
"Apo Hiking Society"
"Hiking"
"Hiking Georgia"
"Hiking Virginia's national forests"
"Hiking trail"
Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.
The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).
How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).
So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?
BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive
Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.
For the algorithm side, I find the following paper a good overview
Zobel and Moffat (2006): "Inverted Files for Text Search Engines".
Most likely it's a trie with lexicographical order.
There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.
An implementation of such algorithms is the Simmetrics library by the University of Sheffield.

Where can I find a good introduction to locales

I have to write some code working with locales. Is there a good introduction to the subject to get me started?
First posted at Everything you need to know about Locales
A long time ago when I was a senior developer in the Windows group at Microsoft, I was sent to the Far East to help get the F.E. version of Windows 3.1 shipped. That was my introduction to localizing software – basically being pushed in to the deep end of the pool and told to learn how to swim. This is where I learned that localization is a lot more than translation.
Note: One interesting thing we hit - the infamous Blue Screen of Death switched the screen into text mode. You can't display Asian languages in text mode. So we (and by we I mean me) came up with a system where we put the screen in VGA mode, stored the 12 pt. courier bitmap at the resolution for just the characters used in BSoD messages, and rendered it that way. You kids today have it so easy J.
So keep in mind that taking locale into account can lead to some very unexpected work.
The Locale
Ok, so forward to today. What is a locale and what do you need to know? A locale is fundamentally the language and country a program is running under. (There can also be a variant added to the country but use of this is extremely rare.) The locale is this combination but you can have any combination of these two parts. For example a Spanish national in Germany would set es_DE so that their user interface is in Spanish (es) but their country settings are in German(DE). Do not assume location based on language or vice-versa.
The language part of the locale is very simple - that's what language you want to display the text in your app in. If the user is a Spanish speaker, you want to display all text in Spanish. But what dialect of Spanish - it is quite different between Spain and Mexico (just as in America we spell color while in England it's colour). So the country can impact the language used, depending on the combination.
All languages that support locale specific resources (which is pretty much all of them today) use a fall-back system. They will first look for a resource for the language_country combination. While es_DE has probably never been done, there often is an es_MX and es_ES. So for a locale set to es_MX it will first look for the es_MX resource. If that is not found, it then looks for the es resource. This is the resource for that language, but not specific to any country. Generally this is copied from the largest country (economically) for that language. If that is not found, it then goes to the "general" resource which is almost always the native language the program was written in.
The theory behind this fallback is you only have to define different resources for the more specific resources - and that is very useful. But even more importantly, when new parts of the UI are made and you want to ship beta copies or you release before you can get everything translated, well then the translated parts are in localized but the untranslated parts still display - but in English. This annoys the snot out of users in other countries, but it does get them the program sooner. (Note: We use Sisulizer for translating our resources - good product.)
The second half is the country. This is used primarily for number and date/time settings. This spans the gamut from what the decimal and thousand separator symbols are (12,345.67 in the U.S. is 12 345,67 in Russia) to what calendar is in use. The way to handle this is by using the run-time classes available for all operations on these elements when interacting with a user. Classes exist for both parsing user entered values as well as displaying them.
Keep a clear distinction between values the user enters or are displayed to the user and values stored internally as data. A number is a string in an XML file but in the XML file it will be "12345.67" (unless someone did something very stupid). Keep your data strongly typed and only do the locale specific conversions when displaying or parsing text to/from the user. Storing data in a locale specific format will bite you in the ass sooner or later.
Chinese
Chinese does not have an alphabet but instead has a set of glyphs. The People's Republic of China several decades ago significantly revised how to draw the glyphs and this is called simplified. The Chinese glyphs used elsewhere continued with the original and that is called traditional. It is the exact same set of characters, but they are drawn differently. It is akin to our having both a text A and a script A - they both mean the same thing but are drawn quite differently.
This is more of a font issue than a translation issue, except that wording and usage has diverged a bit, in part due to the differences in approach between traditional and simplified Chinese. The end result is that you generally do want to have two Chinese language resources, one zh_CN (PRC) and one zh_TW (Taiwan). As to which should be the zh resource - that is a major geopolitical question and you're on your own (but keep in mind PRC has nukes - and you don't).
Strings with substituted values
So you need to display the message Display ("The operation had the error: " + msg); No, no, no! Because in another language the proper usage could be Display("The error: " + msg + " was caused by the operation"); Every modern run-time library has a construct where you can have a string resource "The operation had the error: {0}" and will then substitute in your msg at {0}. (Some use a syntax other than {0}, {1}, …)
You store these strings in a resource file that can be localized. Then when you need to display the message, you load it from the resources, substitute in the variables, and display it. The combination of this, plus the number & date/time formatters make it easy to build up these strings. And once you get used to them, you'll find it easier than the old approach. (If you are using Visual Studio - download and install ResourceRefactoringTool to make this trivial.)
Arabic, Hebrew, and complex scripts.
Arabic & Hebrew are called b-directional because parts of it are right to left while other parts are left to right. The text in Arabic/Hebrew are written and read right to left. But when you get to Latin text or numbers, you then jump to the left-most part and read that left to right, then jump back to where that started and read right to left again. And then there is punctuation and other non-letter characters where the rules depend on where they are used.
Here's the bottom line - it is incredibly complex and there is no way you are going to learn how it works unless you take this on as a full-time job. But not to worry, again the run-time libraries for most languages have classes to handle this. The key to this is the text for a line is stored in the order you read the characters. So in the computer memory it is in left to right order for the order you would read (not display) the characters. In this way everything works normally except when you display the text and determine moving the caret.
Complex scripts like Indic scripts have a different problem. While they are read left to right, you can have cases where some combinations of letters are placed one above the other, so the string is no wider on the screen when the second letter is added. This tends to require a bit of care with caret movement but nothing more.
We even have cases like this in English where ae is sometimes rendered as a single æ character. (When the human race invented languages, they were not thinking computer friendly.)
Don't Over-Stress it
It seems like a lot but it's actually quite simple. In most cases you need to display text based on the closest resource you have. And you use the number & date/time classes for all locales, including your native one. No matter where you live, most computer users are in another country speaking another language - so localizing well significantly increases your potential market.
And if you're a small company, consider offering a free copy for people who translate your product. When I created Page 2 Stage I offered a free copy (list price $79.95) for translating it - and got 28 translations. I also met some very nice people online in the process. For an enterprise level product, many times a VAR in another country will translate it for you at a reduced rate or even free if they see a good market potential. But in these cases, do the first translation in-house to get the kinks worked out.
One resource I find very useful is the Microsoft Language Portal where you can put in text in English and if that text is in any of the Microsoft products, it will give you the translation Microsoft used for a given language. This can give you a fast high-quality translation for up to 80% of your program in many cases.
Удачи! (Good Luck)

Resources