GATE extract NE from documents in German - nlp

I need to extract people names from documents in German (not my native). After a bit of search, I've found GATE framework which seems to support English, German and many other languages. The accuracy for English is quite decent, but it's unacceptable for German (see screenshots).
Here are the PRs:
And a chunk of highlighted people names:
A friend of mine says that none of these is a person name, so I wonder if I misconfigured something. Do I need to specify the language somehow?

Solution: Install the Language: German plugin and use it. The accuracy is still poor though, at least for my case.

Related

Bilingual Ubiquitious Language

One problem in a project with Domain Driven Design:
In discussions about the domain model, many terms of the Ubiquitous Language (UL) are used in German by the team members (all German speakers) , whereas the English version is used within the analysis model and the code model.
What is good practise to handle this issue? Should we force us to use the English term in discussions also, or is it ok to translate the term for modeling and implementation?
I've also worked in multiple, german DDD projects.
In my experience, the team usually automatically starts to use the english terms as soon as there have been first discussions and a first implementation. It helps if you maintain an up-to-date glossary of the german terms and their english counterparts especially to help new team members.
I think UL is just another language, like german or english are. The point is that it has to be spoken and understood by all team members. And it has to be used every where, discussions, documents, diagrams, source code, ...
You cannot use a german term in discussions and its english translated term in source code.
Working in Germany, I was involved in a banking project once where the Team (all Germans) decided to use English to write code, on the off chance that some offshore team might also later be involved.
A couple of months later we already struggled with our English dictionary. We noticed that we (and the business people) would have no problems with the German words, but we ourselves did not always knew the correct translations in the code. Some concepts (like legal terms, etc.) don't even always have a correct translation.
It got even worse. As some offshore colleagues were involved, they started asking for the German words behind some things in the code, because they could google the German words, but not always the English ones. (Might be caused by our bad translations to English...)
Having gone through all of that, I would take whatever the "source" language is into the code. If the source material (requirements, legal framework, business team, etc.) is in German, then German.
The only argument against using German is that it looks and sounds really funny and strange. Oh, and Unicode support is a must. Class "Überweisungsvorlage" vs. CashTransferTemplate.
I know of another Team which combined English verbs in methods with German concepts, like "createÜberweisungsvorlage", because German verbs are quirky and do not lend themselves to short and uniform conventions.
Summary: You decide, and good luck :)
Use whichever language your domain experts use. If many, ask them to choose the one that defines the domain best.
The best language can be different from one subdomain/Bounded Context to another.

How to detect language of smartphone user

For production of a set of mobilephone/ smartphone minisites, what do you recommend as a technology to automatically choose the language of the site:
browser IP address
mobile browser language request header
any method related to device specifics or Carrier specifics of a certain country?
any other method
The languages that will be targeted are:
Vietnamese
German
Thai
Arabic
Spanish
Indonesian
Italian
Japanese
Chinese, both traditional/ Simplified
Korean
Russian
I understand the answers may vary per language, so feedback on all or any language would be greatly appreciated.
Anything other that what a user has specifically requested is a bad idea. So, for example, using geographical IP lookup is a terrible idea. People may live in a country where multiple languages are spoken, or may simply prefer the lingua franca English, and might find it extremely annoying when another language is forced upon them.
Out of the options you mentioned only the browser language request header sounds like something a user might actually configure on his own. All other options I suspect will produce an inferior experience for large portions of the target audience.

Non-English domain naming issues in programming

Most programming code, I imagine is written in English. But I'm curious how people are handling the issue of naming herein. A lot of programming is done within some bussiness domain, usually with well established terms for certain procedures, items.
I'm from Denmark for instance, and something I work a lot with has a term called "indblikskode", which sort of translates to "insight code". So, do I use the line "string indblikskode = ..." in the C# code for some web service related to this? Or do I try to use a translation, such as "insightcode"? The bussiness I'm in isn't even consistent in its language, for instance using the term "organisatorisk enhed" (organizatorical unit), but just as often using the abbreviation "OU", which is obviously abbreviated from the English.
How do other people handle this naming issue, while keeping consistent, and sane (in everything from simple variable names in your code, to database tables, to server names)?
Duplicates:
Should identifiers and comments be always in English or in the native language of the application and developers?
Do you use another language instead of English?
I can only speak for myself, but I always translate terms into English when naming classes and variables, and it's one of our unwritten best coding practices to do so as well. You never know when you might need to hand off development to cheaper labour abroad or the expert expat consultant in town.
The problem with non-English naming of classes and functions is, that you invariably going to end up with macaronic pidgin. Keywords are in English, naming conventions (like for example getters/setters) are also English, same for standard names for design patterns.
You're going to end up with stuff like:
OrganisatoriskEnhedFactory::getInstance()->getIndblikskode();
See my question and answer here.
Basically it depends on your organization and the application. If your company, developers and customers all speak the same native language and you expect it to stay that way, then it would be extremely counter-productive to have everyone become a part-time translator as well. Considerable productivity loss for a purely hypothetical future advantage. YAGNI.
If it's a large international company, or if there are concrete plans to expand internationally or have some work done offshore, it's a different matter, of course.
Having worked in Switzerland (German side ie Zurich) and lived in Germany for a time I can tell you that I've yet to see an environment where the code isn't in English. Sure the application may well be in German (but many professional environemtns are English-speaking anyway) but the code (I've seen) is pretty much all English.
It's hard to write code in other languages. For one thing, the APIs are (nearly) all in English. Java uses JavaBeans naming for example so you have to use set and get anyway and "getGeburtstag" just doesn't have quite the same ring to it as "getDateOfBirth".
Other countries may vary for this has been my experience from the Germanic countries.
We're usually using established English terms (our business domain usually has English terms), but if I can't figure out any suitable term, I could as well use Finnish. Heck, even our comments in code are in mixed languages...
Of course the sensible approach depends largely on whether the source code will ever be used outside the building. In a small shop it's not such a big deal.
I'm working in a company in Austria (so we're talking German) and we are programming in English (variable names, domain objects, GUIs). Makes it a bit more cumbersome, because you have to find the English translations and you have to translate the GUI before releasing the program. I'm not really sure if all the names are really correct.
In contrast in the former company I was working for programmed strictly in German. This was pretty nice (altough German words tend to be longer than English words). After some years the company wanted to use the same program in the USA, so English-speaking programers had to use the same codebase. after this everything got pretty inconsistent- variables, database fields.. in both languages (the English speaking team members didn't talk German).
My experience is that it is easier to handle internationalization in the early beginning (you are forced to do it when you write the program in English) of an application, because it is no big fun localizing a 10000 LOC application. The advantage of writing in another language is that you see instantly what is localized and what is not - altough it's work you have to take in account for that.
To the untranslatable words: we hadn't expierienced that yet - altough it was some work finding the English phrase for "intra-community deliveries" (that's an EU thing). But if that would happen I'm pretty sure we would use the German word.
I live and work in Germany but write English code only. It makes things easier. You can post your code on the net if you want to ask questions or want to publish tutorials about your work.
Also the code looks more "professional" for me.
I also live in and work in Germany for now and we mostly use English except for some old comments in German. I think non-English comments are generally very bad idea since you'll have to spend time trying to understand it (and understand correctly). Although both German and English are not my native languages, code written in anything other than English seems to be bizarre.
You'll never know who would be working on your code the next day. So you should use the universal IT language.
P.S. Since I do not like non-English languages in my development environment, I made a local administrator quite angry when I refused that my PC be installed with German Windows, German Office and German Visual Studio. It took many hours to download the English versions just for me.
Though I think it is good one day to install a language pack or just a different copy of the same software just to learn the terminology. SQL Management Studio in French makes me really excited, just as when I tried to switch Skype to Spanish.

Naming conventions: Looking for alternative to mixing of English and domain/workflow terms

Though at our company all people are non-native english speakers we try hard to write our documentation, code and comments in english, pretty much everything except user-related stuff, of course. This is ok as long as business terms are translateable and not too specific to the domain. But once the business terms get too specific, either there is no adquate translation for them or the translation just sounds silly and meaningless.
This leads to a awful language-mix when writing code.
What is your experience on this topic? Do you avoid silly names in code by all means or do you just live with it?
I think that trying to keep everything in English when all developers and users have a common non-English native language is not only useless but actually harmful. Domain terms are just the most obvious example.
IMO all domain terms should stay in the native language, as should documentation and comments. This allows developers to concentrate on the code's logic rather than translation issues. It may look silly to have a mix of two languages in the code, even within single method names, but IMO it's not really a problem and better than making a great effort to have everything in English when nobody actually benefits from it.
Of course, this only applies in the described scenario. If you're a department of a large international company, or planning to expand your market internationally, or your native language has not very many speakers, then it's a different matter.
I've no experience with this as I am a native English speaker. However when the domain is complex, Extreme Programming suggests that you use a metaphor that programmers and customers are comfortable with.
This could be applied in your situation.
We have exactly the same problem in our company. We try to write the code and examples in English, but we often find unsolvable situations where direct common names in our native language have no direct or unknown translation or equivalence. So from my point of view, it is almost impossible to avoid mixing languages (will actually Spanglish be the language of the future despite my will?).
Basically it is a matter of two things:
Insufficient general English language knowledge.
Not everybody is perfectionist enough to spend some seconds to find out on Internet how to correctly say something, so what first comes to mind goes to code.
The effects of these are:
Comments can not be ununderstandable by native English speakers (because the used words do not exist or sentences are a literal translation from source language).
Incorrect translations full of false friends or invented words. A typical example of this would be using "actual" word for "current" meaning since it is what "actual" means in Spanish.
The actual solution to this is try to correct this is to attack the source of the problem:
English tranining courses (included applied technical English) should be a must in the company, so every worker could reach a minimun acceptable level of English knowledge.
Force the not-so-perfectionist people to follow some rules of coding by defining a stylesheet of code or even better have a QA department that enforces the quality of code, grammar of comments and readability included.

(human) Language of a document

Is there a way (a program, a library) to approximately know which language a document is written in?
I have a bunch of text documents (~500K) in mixed languages to import in a i18n enabled CMS (Drupal)..
I don't need perfect matches, only some guess.
There is a pretty easy way to do this, given that you have corpus data in all the different languages you'll need to identify. It's called n-gram modeling. I think Lingua::Identify does this already, though, so that is your best bet rather than implementing your own.
I'd say your best bet is to look for key words - articles, that kind of thing - that are unique to the languages you're looking for. "Un" will show up in both Spanish and French, for example, but "une" is identifiably French whereas "unos", for example, is identifiably Spanish. Diacritics are useful too - you'll see "ñ" in Spanish and possibly Portuguese, "ç" in French and a few others... that kind of thing.
edit - Paul's solution is probably the best; looks like it uses methods like what I outlined, plus a few extra.
By running a Google search for "determine language of document" I found many different sites that will help you. The third link on the first page eventually led me to a function in the Google Code API that is exactly what you need.
Google Translation API is cool, and has a REST interface. But I need to send it a LOT of BIG document (yes, I could use an excerpt) and, even if Google is Google, I don't think this
fair.
Document are also not mine, and Id ask my client if it is ok to send them to a third party (even if, soon or later, G will get them ;)).
I think I'll go trough the Perl path...
There seems to be a Perl module for this: Lingua::Identify
Paul.

Resources