(human) Language of a document - nlp

Is there a way (a program, a library) to approximately know which language a document is written in?
I have a bunch of text documents (~500K) in mixed languages to import in a i18n enabled CMS (Drupal)..
I don't need perfect matches, only some guess.

There is a pretty easy way to do this, given that you have corpus data in all the different languages you'll need to identify. It's called n-gram modeling. I think Lingua::Identify does this already, though, so that is your best bet rather than implementing your own.

I'd say your best bet is to look for key words - articles, that kind of thing - that are unique to the languages you're looking for. "Un" will show up in both Spanish and French, for example, but "une" is identifiably French whereas "unos", for example, is identifiably Spanish. Diacritics are useful too - you'll see "ñ" in Spanish and possibly Portuguese, "ç" in French and a few others... that kind of thing.
edit - Paul's solution is probably the best; looks like it uses methods like what I outlined, plus a few extra.

By running a Google search for "determine language of document" I found many different sites that will help you. The third link on the first page eventually led me to a function in the Google Code API that is exactly what you need.

Google Translation API is cool, and has a REST interface. But I need to send it a LOT of BIG document (yes, I could use an excerpt) and, even if Google is Google, I don't think this
fair.
Document are also not mine, and Id ask my client if it is ok to send them to a third party (even if, soon or later, G will get them ;)).
I think I'll go trough the Perl path...

There seems to be a Perl module for this: Lingua::Identify
Paul.

Related

language popularity figures (C++, C#, Java, PHP, flash script, etc.)

I need to find figures that show how many programmers world wide, has each of the following languages as their primary programming language.
C
C++
C#
Object-C
Java
JavaScript
VB.NET
VB6 (or older)
VBA
PHP
flash scripts
Ruby
Does anyone know of such comparison figures?
If not. Do you know of a good way to research this?
I could compare the number of tags here at stackoverflow and the number of articles for each language at sites like codeproject. This would give me a good idea.
But if you can suggest other ideas how to find these numbers I will be greatfull.
/Thomas
A very common site that does this is the TIOBE index. It basically searches for programming languages in major search engines and compares the results, and it shows you some history. The only problem is that C/C++/C# are not distinguished very well, therefore C is more dominant than you'd expect (not to mention that search results include many pages where many languages are listed, like programming FAQs). But in general, TIOBE gives a good idea, I think, and it should get better, since at least Google tends to know the difference between zero, two or four pluses.
Have you tried TIOBE index?
In general this is hard to measure because every approach has a lot of drawbacks.
TIOBE and others that are based on search results e.g. do not tell anything of what is actually used but just what is highly ranked by google (You can even see that just Google changing a bit of their results in 2004/2005 completely mixed TIOBE). And moreover they have the problem that lots of search-terms are ambiguous (Like Java which IS also an island, Ruby which also exists as gem, Python which is a snake and others which have alternative meaning). Another problem with search based is that most things put into the web stay up forever which means it is irrelevant if it is CURRENTLY interesting. If a C resource was put up in 2002 it likely still is available today (which hugely overrates leading or older languages.)
Here one is an interesting approach based on the number of book sales. (This at least eliminates the ambigous problem, but comes with others.)
Wikipedia also has a small article about the topic.
Try Google trends (see an example). In addition, check sites like freshmeat.net and note the number of projects in each language. That's only open source projects and many people will use a different language for their hobby projects than at work (i.e. one that sucks less).
Next, look for sites which offer job openings. I don't have a good link handy but this Google query should get your started.
not yet!!!!!!!
That's only open source projects and many people will use a different language for their hobby projects than at work (i.e. one that sucks less).
Next, look for sites which offer job openings. I don't have a good link handy but this Google query should get your started.

Non-English domain naming issues in programming

Most programming code, I imagine is written in English. But I'm curious how people are handling the issue of naming herein. A lot of programming is done within some bussiness domain, usually with well established terms for certain procedures, items.
I'm from Denmark for instance, and something I work a lot with has a term called "indblikskode", which sort of translates to "insight code". So, do I use the line "string indblikskode = ..." in the C# code for some web service related to this? Or do I try to use a translation, such as "insightcode"? The bussiness I'm in isn't even consistent in its language, for instance using the term "organisatorisk enhed" (organizatorical unit), but just as often using the abbreviation "OU", which is obviously abbreviated from the English.
How do other people handle this naming issue, while keeping consistent, and sane (in everything from simple variable names in your code, to database tables, to server names)?
Duplicates:
Should identifiers and comments be always in English or in the native language of the application and developers?
Do you use another language instead of English?
I can only speak for myself, but I always translate terms into English when naming classes and variables, and it's one of our unwritten best coding practices to do so as well. You never know when you might need to hand off development to cheaper labour abroad or the expert expat consultant in town.
The problem with non-English naming of classes and functions is, that you invariably going to end up with macaronic pidgin. Keywords are in English, naming conventions (like for example getters/setters) are also English, same for standard names for design patterns.
You're going to end up with stuff like:
OrganisatoriskEnhedFactory::getInstance()->getIndblikskode();
See my question and answer here.
Basically it depends on your organization and the application. If your company, developers and customers all speak the same native language and you expect it to stay that way, then it would be extremely counter-productive to have everyone become a part-time translator as well. Considerable productivity loss for a purely hypothetical future advantage. YAGNI.
If it's a large international company, or if there are concrete plans to expand internationally or have some work done offshore, it's a different matter, of course.
Having worked in Switzerland (German side ie Zurich) and lived in Germany for a time I can tell you that I've yet to see an environment where the code isn't in English. Sure the application may well be in German (but many professional environemtns are English-speaking anyway) but the code (I've seen) is pretty much all English.
It's hard to write code in other languages. For one thing, the APIs are (nearly) all in English. Java uses JavaBeans naming for example so you have to use set and get anyway and "getGeburtstag" just doesn't have quite the same ring to it as "getDateOfBirth".
Other countries may vary for this has been my experience from the Germanic countries.
We're usually using established English terms (our business domain usually has English terms), but if I can't figure out any suitable term, I could as well use Finnish. Heck, even our comments in code are in mixed languages...
Of course the sensible approach depends largely on whether the source code will ever be used outside the building. In a small shop it's not such a big deal.
I'm working in a company in Austria (so we're talking German) and we are programming in English (variable names, domain objects, GUIs). Makes it a bit more cumbersome, because you have to find the English translations and you have to translate the GUI before releasing the program. I'm not really sure if all the names are really correct.
In contrast in the former company I was working for programmed strictly in German. This was pretty nice (altough German words tend to be longer than English words). After some years the company wanted to use the same program in the USA, so English-speaking programers had to use the same codebase. after this everything got pretty inconsistent- variables, database fields.. in both languages (the English speaking team members didn't talk German).
My experience is that it is easier to handle internationalization in the early beginning (you are forced to do it when you write the program in English) of an application, because it is no big fun localizing a 10000 LOC application. The advantage of writing in another language is that you see instantly what is localized and what is not - altough it's work you have to take in account for that.
To the untranslatable words: we hadn't expierienced that yet - altough it was some work finding the English phrase for "intra-community deliveries" (that's an EU thing). But if that would happen I'm pretty sure we would use the German word.
I live and work in Germany but write English code only. It makes things easier. You can post your code on the net if you want to ask questions or want to publish tutorials about your work.
Also the code looks more "professional" for me.
I also live in and work in Germany for now and we mostly use English except for some old comments in German. I think non-English comments are generally very bad idea since you'll have to spend time trying to understand it (and understand correctly). Although both German and English are not my native languages, code written in anything other than English seems to be bizarre.
You'll never know who would be working on your code the next day. So you should use the universal IT language.
P.S. Since I do not like non-English languages in my development environment, I made a local administrator quite angry when I refused that my PC be installed with German Windows, German Office and German Visual Studio. It took many hours to download the English versions just for me.
Though I think it is good one day to install a language pack or just a different copy of the same software just to learn the terminology. SQL Management Studio in French makes me really excited, just as when I tried to switch Skype to Spanish.

Naming conventions: Looking for alternative to mixing of English and domain/workflow terms

Though at our company all people are non-native english speakers we try hard to write our documentation, code and comments in english, pretty much everything except user-related stuff, of course. This is ok as long as business terms are translateable and not too specific to the domain. But once the business terms get too specific, either there is no adquate translation for them or the translation just sounds silly and meaningless.
This leads to a awful language-mix when writing code.
What is your experience on this topic? Do you avoid silly names in code by all means or do you just live with it?
I think that trying to keep everything in English when all developers and users have a common non-English native language is not only useless but actually harmful. Domain terms are just the most obvious example.
IMO all domain terms should stay in the native language, as should documentation and comments. This allows developers to concentrate on the code's logic rather than translation issues. It may look silly to have a mix of two languages in the code, even within single method names, but IMO it's not really a problem and better than making a great effort to have everything in English when nobody actually benefits from it.
Of course, this only applies in the described scenario. If you're a department of a large international company, or planning to expand your market internationally, or your native language has not very many speakers, then it's a different matter.
I've no experience with this as I am a native English speaker. However when the domain is complex, Extreme Programming suggests that you use a metaphor that programmers and customers are comfortable with.
This could be applied in your situation.
We have exactly the same problem in our company. We try to write the code and examples in English, but we often find unsolvable situations where direct common names in our native language have no direct or unknown translation or equivalence. So from my point of view, it is almost impossible to avoid mixing languages (will actually Spanglish be the language of the future despite my will?).
Basically it is a matter of two things:
Insufficient general English language knowledge.
Not everybody is perfectionist enough to spend some seconds to find out on Internet how to correctly say something, so what first comes to mind goes to code.
The effects of these are:
Comments can not be ununderstandable by native English speakers (because the used words do not exist or sentences are a literal translation from source language).
Incorrect translations full of false friends or invented words. A typical example of this would be using "actual" word for "current" meaning since it is what "actual" means in Spanish.
The actual solution to this is try to correct this is to attack the source of the problem:
English tranining courses (included applied technical English) should be a must in the company, so every worker could reach a minimun acceptable level of English knowledge.
Force the not-so-perfectionist people to follow some rules of coding by defining a stylesheet of code or even better have a QA department that enforces the quality of code, grammar of comments and readability included.

Creating a Mobile Programming Language

I'm thinking about creating a small language that is very easy to type on a mobile phone (J2ME),
What is the more appropriate language to implement in order to run it inside a mobile phone (j2me always)? Appropriate meaning, small/easy syntax, easy to type in a mobile phone.
Is it lisp? Some sort of Basic/Python/Ruby (I think not...)? Or another new (can you propose a new syntax?)?
I am the author of just such a language: Hecl, at http://www.hecl.org . In order to make quite applications easier, I also created a site where you can build simple apps through a web interface: http://www.heclbuilder.com . I also wrote an article discussing the implementation of the language:
http://www.welton.it/articles/hecl_implementation
Other languages that are worth looking at include Lua, and Javascript, both of which have mobile implementations.
If you include editor support (nesting structures, indented display, balancing, ...) then some form of LISP would be relatively straightforward to implement and use. I've seen screenshots (but can't find them now) of a LISP-based language for live interactive-performance programming. It used indented, shaded rectangular areas on the screen (instead of parentheses) to show nesting of structure.
I would think the design of the editor would be the biggest consideration, not the language. For instance, supporting some kind of "intellisense"-like autocompletion would be vital for saving thumbstrokes. Some kind of language sensitivity in the editor would help a lot too. For instance, when a C user types "for" the autocomplete should show an option for filling out the syntax of a loop:
for (;;) {
}
You might want to look into Hecl: http://www.hecl.org/
I'm not sure what's easy to type on a mobile phone, but the language I know with the most computing power per character is APL. As a source of syntactic or design ideas, you might prefer its modern successor, the J programming language.
On a mobile phone, you should also consider languages like Scratch (smalltalk), because the non-typing interface would be easy to use.
Also on the smartphones with drag&drop capability, it would be something good.
On the other hand, the IDE would be a lot heavier on CPU & other resources.
Forth is usually considered a legitimate contender for these kinds of requirements. And it's about as terse as can be imagined. Extensible, small and malleable. Built-in small screen editor, too.
If you want super-compact, try nano-False http://www.aldweb.com/pages/winikoff/#false
It isn't very usable, although more so than the deliberately painful Brainfuck and Whitepace. Think of it as Forth with the easy syntax made more concise ;-)
I found Quartus Forth reasonably easy to use, provided you can think in stacks, and with more Intellisense support for the API it would have been much more productive. For prototyping little algorithms on the Palm I preferred Plua or Lispme. The LispMe environment is worth studying anyway because it provided good use of lists for finding keywords and so eased GUI programming
The big decision you have to make is whether you expect users to just use a phone numeric keypad or be able to type in reasonable approximations to a full keyboard. One of the huge benefits of the Palm was the high-quality full-size folding keyboards which I sadly miss (and hope someone makes an iPhone accessory to connect). If you don't have a full keyboard, make use of selectors for verbs so they can use picking actions rather than having to type in words. Consider the amount of code typed in traditional code for the framework classes and methods compared to the user code.
When I go about dreaming about a language, I think about what features are important to me at the time I'm dreaming. Only once you figure out what features are important to you can you come up with the best answer to what syntax. For example, if you want named parameters, it greatly influences your design choice about how method calls look (a la Objective-C or Python).
Designing a language can be a really fun task. I encourage you to step back and ask yourself "Do I really like how this is done in X?" (substituting some language name). If that's something you've always loved, steal it. If not, look elsewhere. Create your ultimate mashup of what you love, and leave out what you hate!
Lisp would be difficult to type because of all the ()s, although joel.neely's answer demonstrates one way of working around that problem.
So if you want to use an existing language you might want to look at which ones use least unusual characters.
Then there's the screen size issue. The more verbose the language the less code you're going to be able to fit onto the screen at once. What kind of devices are you aiming at? Smartphones with big screens (a limited audience) or 240x240 pixel feature phones?
Bear in mind that the interpreter/VM for your language will have to fit into a small amount of memory and performance may not be very good.
Brainfuck has only 8 characters -- very easy to type in on a mobile phone.
Of course, understanding and doing stuff with it... not so easy. But it satisfies the requirement....
Basic is very easy.
I would stay away from lisp. Unless you want to give your mobile users a headache on top of the headache they have from radio waves.

How do people choose product names?

I flatter myself that I'm a good programmer, and can get away with graphic design. But something I'm incapable of doing is coming up with good names - and it seems neither are the people I work with. We're now in the slightly ludicrous situation that the product we've been working on for a couple of years is being installed to customers, is well received and is making money - but doesn't yet have a name.
We're too small a company to have anything like a proper marketing division to do this thing. So how have people tended to choose names, logos and branding?
When it's for something that "matters", I plop down the $50 and have the folks at PickyDomains.com help out. That also results in a name that's available as a .com.
For guidelines, here's an extract from my own guide on naming open source projects:
If the name you're thinking of is directly pulled from a scifi or fantasy source, don't bother. These sources are WAY overrepresented as naming sources in software. Not only are your chances of coming up with something original pretty small, most of the names of characters and places in scifi are trademarked and you run the risk of being sued.
If the name you're thinking of comes straight from Greek, Roman or Norse mythology, try again. We've got more than enough mail related software called variations of "Mercury".
Run your proposed name through Google. The fewer results you get the better. If you get down to no results, you're there.
Don't try to get a unique name by just slightly misspelling something. Calling your new Windows filesystem program Phat32 is just going to end up with users getting frustrated looking at the results of "fat32" in a search engine.
If your name couldn't be said on TV in the 50s or 60s, you're probably on the wrong track. This is particularly true if you would like anyone to use your product in a work environment. No one is going to recommend a product to their co-workers if they can get sued for sexual harassment just for uttering its name.
If your product name can't be pronounced at all, you'll get no word of mouth benefit at all. Similarly, if no one knows how to pronounce it, they will not be very likely to try to say it out loud to ask questions about it, etc. How do YOU say MySQL? PostgreSQL? GNU? Almost all spoken languages on Earth are based on consonant/vowel syllables of some sort. Alternating between consonants and vowels is a pretty good way to ensure that someone can pronounce it.
The shorter the better.
See if the .com domain is available. If it's not, it's a pretty good indicator that someone has already thought of it and is using it or closer to using it than you are. Do this even if you don't intend to use the domain.
Don't build inherent limitations on your product into the name. Calling your product LinProduct or WinProduct precludes you from ever releasing any sort of cross-platform edition.
Don't use your own name for open source products. If the project lives on beyond your involvement, the project will either have to be renamed or your name may be used in ways you didn't intend.
for a product, first read Positioning, the Battle for Your Mind and think really hard about what mental position you want to occupy
then find a word or two that conveys that position, and make up an acronym for it
for a (self-serving) example: my most recent product is a fine-grained application monitor for .NET applications. I want to convey the feeling of peace that you have when you know that your apps are behaving because they are continuously monitored, so 'no news' really is 'good news'. I chose CALM after a lot of false starts, and decided that it stood for Common Application Lightweight Monitor - which just also happens to be a very technically accurate description of the basic implementation
also, you might be amazed at how much 'better' users perceive an application to be when it has a name and a logo attached to it.
You should try BustaName. It basically combines words to create available domain names. You are able to choose similar words for the words that you previously entered.
Also try these links out:
Naming a company
77 ways to come up with an idea
Igor Naming Guide (PDF)
Names -- you can try yourselves or ask friends/customers about what they are thinking about when listen/use your product (I don't know correct English word for that -- if two things have something in common they are associated?).
Or, depends on what kind of product is it, ask someone with unlimited imagination -- kids are very good at it.
Logos and branding -- you need professionals.
And of course you need layer :).
I second the recommendation of the Igor naming guide. Stay away from meaningless strings of alternating vowels and consonants: altana, obito, temora, even if it seems easy and the domains are readily available. Pick something with soul and meaning. Best example: "Plan B" (also known as the morning-after pill).

Resources