How can I determine the language of a web page, like Chrome does? - nlp

I am trying to get corpus for a certain language. But when I get a webpage, how can I determine the language of it?
Chrome can do it, but what's the principle?
I can come up with some ad-hoc methods like educated guess based on characters set, IP address, HTML tags etc. But more formal method?

I suppose the common method is looking at things like letter frequencies, common letter sequences and words, character sets (as you describe)... there are lots of different ways. An easy one would be to just get a bunch of dictionary files for various languages and test which one gets the most hits from the page, then offer, say, the next three as alternatives.

If you are just interested in collecting corpora of different languages, you can look at country specific pages. For example, <website>.es is likely to be in Spanish, and <website>.de is likely to be in German.
Also, Wikipedia is translated into many languages. It is not hard to write a scraper for a particular language.

The model that determines a webpage's language in Chrome is called the Compact Language Detector v3 (CLD3) and it's open source C++ code (sort of, it's not reproducible). There's also official Python bindings for it:
pip install gcld3

Related

Class/structure members in a dynamic programming language interpreter

For statically typed languages member access is easy, you just calculate the offset of the member at compilation time. However how does ruby or python do it? There are maybe more structures with same member names and you are not even sure what kind of object a variable holds. Do they use some kind of dictionary to look up the member at runtime?
If you were implementing a dynamically typed language from scratch, that's probably where you would start - with something like a hashtable based dictionary, and it's a perfectly ok solution.
Some dynamic language runtimes optimized for size rather than performance (such as Jerryscript, a highly size-optimized Javascript interpreter) use this sort of approach exclusively, and it works fine.
However, most modern JIT-based dynamic language runtimes such as the V8 Javascript engine (used in Chrome and Node.js) or the JSC Javascript engine used in Safari only use this sort of dictionary as a fallback if they can't do anything better.
Here's an answer to another question where I described how V8 maps work and how they make property access very efficient. It contains a link to a much more detailed description, and a video by Lars Bak who is one of the lead engineers for V8, which is good if you're interested in how this stuff works.
With Ruby (as with Javascript), there are a number of different implementations (JRuby, MRI, Rubinius, etc), so to answer the question "How does Ruby do it" is difficult - each implementation will do it in a different way (although there will be many similarities, imposed by the language design).
Since you seemed to be asking about the concepts in dynamic languages in general, hopefully you'll find the link above gives you some useful information about possible implementations.

unicode characters

In my application I have unicode strings, I need to tell in which language the string is in,
I want to do it by narrowing list of possible languages by determining in which range the characters of string are.
Ranges I have from http://jrgraphix.net/research/unicode_blocks.php
And possible languages from http://unicode-table.com/en/
The problem is that algorithm has to detect all languages, does someone know more wide mapping of unicode ranges to languages ?
Thanks
Wojciech
This is not really possible, for a couple of reasons:
Many languages share the same writing system. Look at English and Dutch, for example. Both use the Basic Latin alphabet. By only looking at the range of code points, you simply cannot distinguish between them.
Some languages use more characters, but there is no guarantee that a
specific piece of text contains them. German, for example, uses the
Basic Latin alphabet plus "ä", "ö", "ü" and "ß". While these letters
are not particularly rare, you can easily create whole sentences
without them. So, a short text might not contain them. Thus, again,
looking at code points alone is not enough.
Text is not always "pure". An English text may contain French letters
because of a French loanword (e.g. "déjà vu"). Or it may contain
foreign words, because the text is talking about foreign things (e.g.
"Götterdämmerung is an opera by Richard Wagner...", or "The Great
Wall of China (万里长城) is..."). Looking at code points alone would be
misleading.
To sum up, no, you cannot reliably map code point ranges to languages.
What you could do: Count how often each character appears in the text and heuristically compare with statistics about known languages. Or analyse word structures, e.g. with Markov chains. Or search for the words in dictionaries (taking inflection, composition etc. into account). Or a combination of these.
But this is hard and a lot of work. You should rather use an existing solution, such as those recommended by deceze and Esailija.
I like the suggestion of using something like google translate -- as they will be doing all the work for you.
You might be able to build a rule-based system that gets you part of the way there. Build heuristic rules for languages and see if that is sufficient. Certain Tibetan characters do indicate Tibetan, and there are unique characters in many languages that will be a give away. But as the other answer pointed out, a limited sample of text may not be that accurate, as you may not have a clear indicator.
Languages will however differ in the frequencies that each character appears, so you could have a basic fingerprint of each language you need to classify and make guesses based on letter frequency. This probably goes a bit further than a rule-based system. Probably a good tool to build this would be a text classification algorithm, which will do all the analysis for you. You would train an algorithm on different languages, instead of having to articulate the actual rules yourself.
A much more sophisticated version of this is presumably what Google does.

language popularity figures (C++, C#, Java, PHP, flash script, etc.)

I need to find figures that show how many programmers world wide, has each of the following languages as their primary programming language.
C
C++
C#
Object-C
Java
JavaScript
VB.NET
VB6 (or older)
VBA
PHP
flash scripts
Ruby
Does anyone know of such comparison figures?
If not. Do you know of a good way to research this?
I could compare the number of tags here at stackoverflow and the number of articles for each language at sites like codeproject. This would give me a good idea.
But if you can suggest other ideas how to find these numbers I will be greatfull.
/Thomas
A very common site that does this is the TIOBE index. It basically searches for programming languages in major search engines and compares the results, and it shows you some history. The only problem is that C/C++/C# are not distinguished very well, therefore C is more dominant than you'd expect (not to mention that search results include many pages where many languages are listed, like programming FAQs). But in general, TIOBE gives a good idea, I think, and it should get better, since at least Google tends to know the difference between zero, two or four pluses.
Have you tried TIOBE index?
In general this is hard to measure because every approach has a lot of drawbacks.
TIOBE and others that are based on search results e.g. do not tell anything of what is actually used but just what is highly ranked by google (You can even see that just Google changing a bit of their results in 2004/2005 completely mixed TIOBE). And moreover they have the problem that lots of search-terms are ambiguous (Like Java which IS also an island, Ruby which also exists as gem, Python which is a snake and others which have alternative meaning). Another problem with search based is that most things put into the web stay up forever which means it is irrelevant if it is CURRENTLY interesting. If a C resource was put up in 2002 it likely still is available today (which hugely overrates leading or older languages.)
Here one is an interesting approach based on the number of book sales. (This at least eliminates the ambigous problem, but comes with others.)
Wikipedia also has a small article about the topic.
Try Google trends (see an example). In addition, check sites like freshmeat.net and note the number of projects in each language. That's only open source projects and many people will use a different language for their hobby projects than at work (i.e. one that sucks less).
Next, look for sites which offer job openings. I don't have a good link handy but this Google query should get your started.
not yet!!!!!!!
That's only open source projects and many people will use a different language for their hobby projects than at work (i.e. one that sucks less).
Next, look for sites which offer job openings. I don't have a good link handy but this Google query should get your started.

(human) Language of a document

Is there a way (a program, a library) to approximately know which language a document is written in?
I have a bunch of text documents (~500K) in mixed languages to import in a i18n enabled CMS (Drupal)..
I don't need perfect matches, only some guess.
There is a pretty easy way to do this, given that you have corpus data in all the different languages you'll need to identify. It's called n-gram modeling. I think Lingua::Identify does this already, though, so that is your best bet rather than implementing your own.
I'd say your best bet is to look for key words - articles, that kind of thing - that are unique to the languages you're looking for. "Un" will show up in both Spanish and French, for example, but "une" is identifiably French whereas "unos", for example, is identifiably Spanish. Diacritics are useful too - you'll see "ñ" in Spanish and possibly Portuguese, "ç" in French and a few others... that kind of thing.
edit - Paul's solution is probably the best; looks like it uses methods like what I outlined, plus a few extra.
By running a Google search for "determine language of document" I found many different sites that will help you. The third link on the first page eventually led me to a function in the Google Code API that is exactly what you need.
Google Translation API is cool, and has a REST interface. But I need to send it a LOT of BIG document (yes, I could use an excerpt) and, even if Google is Google, I don't think this
fair.
Document are also not mine, and Id ask my client if it is ok to send them to a third party (even if, soon or later, G will get them ;)).
I think I'll go trough the Perl path...
There seems to be a Perl module for this: Lingua::Identify
Paul.

Detecting programming language from a snippet [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
What would be the best way to detect what programming language is used in a snippet of code?
I think that the method used in spam filters would work very well. You split the snippet into words. Then you compare the occurences of these words with known snippets, and compute the probability that this snippet is written in language X for every language you're interested in.
http://en.wikipedia.org/wiki/Bayesian_spam_filtering
If you have the basic mechanism then it's very easy to add new languages: just train the detector with a few snippets in the new language (you could feed it an open source project). This way it learns that "System" is likely to appear in C# snippets and "puts" in Ruby snippets.
I've actually used this method to add language detection to code snippets for forum software. It worked 100% of the time, except in ambiguous cases:
print "Hello"
Let me find the code.
I couldn't find the code so I made a new one. It's a bit simplistic but it works for my tests. Currently if you feed it much more Python code than Ruby code it's likely to say that this code:
def foo
puts "hi"
end
is Python code (although it really is Ruby). This is because Python has a def keyword too. So if it has seen 1000x def in Python and 100x def in Ruby then it may still say Python even though puts and end is Ruby-specific. You could fix this by keeping track of the words seen per language and dividing by that somewhere (or by feeding it equal amounts of code in each language).
class Classifier
def initialize
#data = {}
#totals = Hash.new(1)
end
def words(code)
code.split(/[^a-z]/).reject{|w| w.empty?}
end
def train(code,lang)
#totals[lang] += 1
#data[lang] ||= Hash.new(1)
words(code).each {|w| #data[lang][w] += 1 }
end
def classify(code)
ws = words(code)
#data.keys.max_by do |lang|
# We really want to multiply here but I use logs
# to avoid floating point underflow
# (adding logs is equivalent to multiplication)
Math.log(#totals[lang]) +
ws.map{|w| Math.log(#data[lang][w])}.reduce(:+)
end
end
end
# Example usage
c = Classifier.new
# Train from files
c.train(open("code.rb").read, :ruby)
c.train(open("code.py").read, :python)
c.train(open("code.cs").read, :csharp)
# Test it on another file
c.classify(open("code2.py").read) # => :python (hopefully)
Language detection solved by others:
Ohloh's approach: https://github.com/blackducksw/ohcount/
Github's approach: https://github.com/github/linguist
Guesslang is a possible solution:
http://guesslang.readthedocs.io/en/latest/index.html
There's also SourceClassifier:
https://github.com/chrislo/sourceclassifier/tree/master
I became interested in this problem after finding some code in a blog article which I couldn't identify. Adding this answer since this question was the first search hit for "identify programming language".
An alternative is to use highlight.js, which performs syntax highlighting but uses the success-rate of the highlighting process to identify the language. In principle, any syntax highlighter codebase could be used in the same way, but the nice thing about highlight.js is that language detection is considered a feature and is used for testing purposes.
UPDATE: I tried this and it didn't work that well. Compressed JavaScript completely confused it, i.e. the tokenizer is whitespace sensitive. Generally, just counting highlight hits does not seem very reliable. A stronger parser, or perhaps unmatched section counts, might work better.
First, I would try to find the specific keyworks of a language e.g.
"package, class, implements "=> JAVA
"<?php " => PHP
"include main fopen strcmp stdout "=>C
"cout"=> C++
etc...
It's very hard and sometimes impossible. Which language is this short snippet from?
int i = 5;
int k = 0;
for (int j = 100 ; j > i ; i++) {
j = j + 1000 / i;
k = k + i * j;
}
(Hint: It could be any one out of several.)
You can try to analyze various languages and try to decide using frequency analysis of keywords. If certain sets of keywords occur with certain frequencies in a text it's likely that the language is Java etc. But I don't think you will get anything that is completely fool proof, as you could name for example a variable in C the same name as a keyword in Java, and the frequency analysis will be fooled.
If you take it up a notch in complexity you could look for structures, if a certain keyword always comes after another one, that will get you more clues. But it will also be much harder to design and implement.
It would depend on what type of snippet you have, but I would run it through a series of tokenizers and see which language's BNF it came up as valid against.
I needed this so i created my own.
https://github.com/bertyhell/CodeClassifier
It's very easily extendable by adding a training file in the correct folder.
Written in c#. But i imagine the code is easily converted to any other language.
Best solution I have come across is using the linguist gem in a Ruby on Rails app. It's kind of a specific way to do it, but it works. This was mentioned above by #nisc but I will tell you my exact steps for using it. (Some of the following command line commands are specific to ubuntu but should be easily translated to other OS's)
If you have any rails app that you don't mind temporarily messing with, create a new file in it to insert your code snippet in question. (If you don't have rails installed there's a good guide here although for ubuntu I recommend this. Then run rails new <name-your-app-dir> and cd into that directory. Everything you need to run a rails app is already there).
After you have a rails app to use this with, add gem 'github-linguist' to your Gemfile (literally just called Gemfile in your app directory, no ext).
Then install ruby-dev (sudo apt-get install ruby-dev)
Then install cmake (sudo apt-get install cmake)
Now you can run gem install github-linguist (if you get an error that says icu required, do sudo apt-get install libicu-dev and try again)
(You may need to do a sudo apt-get update or sudo apt-get install make or sudo apt-get install build-essential if the above did not work)
Now everything is set up. You can now use this any time you want to check code snippets. In a text editor, open the file you've made to insert your code snippet (let's just say it's app/test.tpl but if know the extension of your snippet, use that instead of .tpl. If you don't know the extension, don't use one). Now paste your code snippet in this file. Go to command line and run bundle install (must be in your application's directory). Then run linguist app/test.tpl (more generally linguist <path-to-code-snippet-file>). It will tell you the type, mime type, and language. For multiple files (or for general use with a ruby/rails app) you can run bundle exec linguist --breakdown in your application's directory.
It seems like a lot of extra work, especially if you don't already have rails, but you don't actually need to know ANYTHING about rails if you follow these steps and I just really haven't found a better way to detect the language of a file/code snippet.
This site seems to be pretty good at identifying languages, if you want a quick way to paste a snippet into a web form, rather than doing it programmatically: http://dpaste.com/
Nice puzzle.
I think it is imposible to detect all languages. But you could trigger on key tokens. (certain reserved words and often used character combinations).
Ben there are a lot of languages with similar syntax. So it depends on the size of the snippet.
Prettify is a Javascript package that does an okay job of detecting programming languages:
http://code.google.com/p/google-code-prettify/
It is mainly a syntax highlighter, but there is probably a way to extract the detection part for the purposes of detecting the language from a snippet.
I wouldn't think there would be an easy way of accomplishing this. I would probably generate lists of symbols/common keywords unique to certain languages/classes of languages (e.g. curly brackets for C-style language, the Dim and Sub keywords for BASIC languages, the def keyword for Python, the let keyword for functional languages). You then might be able to use basic syntax features to narrow it down even further.
I think the biggest distinction between languages is its structure. So my idea would be to look at certain common elements across all languages and see how they differ. For example, you could use regexes to pick out things such as:
function definitions
variable declarations
class declarations
comments
for loops
while loops
print statements
And maybe a few other things that most languages should have. Then use a point system. Award at most 1 point for each element if the regex is found. Obviously, some languages will use the exact same syntax (for loops are often written like for(int i=0; i<x; ++i) so multiple languages could each score a point for the same thing, but at least you're reducing the likelihood of it being an entirely different language). Some of them might scores 0s across the board (the snippet doesnt contain a function at all, for example) but thats perfectly fine.
Combine this with Jules' solution, and it should work pretty well. Maybe also look for frequencies of keywords for an extra point.
Interesting. I have a similar task to recognize text in different formats. YAML, JSON, XML, or Java properties? Even with syntax errors, for example, I should tell apart JSON from XML with confidence.
I figure how we model the problem is critical. As Mark said, single-word tokenization is necessary but likely not enough. We will need bigrams, or even trigrams. But I think we can go further from there knowing that we are looking at programming languages. I notice that almost any programming language has two unique types of tokens -- symbols and keywords. Symbols are relatively easy (some symbols might be literals not part of the language) to recognize. Then bigrams or trigrams of symbols will pick up unique syntax structures around symbols. Keywords is another easy target if the training set is big and diverse enough. A useful feature could be bigrams around possible keywords. Another interesting type of token is whitespace. Actually if we tokenize in the usual way by white space, we will loose this information. I'd say, for analyzing programming languages, we keep the whitespace tokens as this may carry useful information about the syntax structure.
Finally if I choose a classifier like random forest, I will crawl github and gather all the public source code. Most of the source code file can be labeled by file suffix. For each file, I will randomly split it at empty lines into snippets of various sizes. I will then extract the features and train the classifier using the labeled snippets. After training is done, the classifier can be tested for precision and recall.
I believe that there is no single solution that could possibly identify what language a snippet is in, just based upon that single snippet. Take the keyword print. It could appear in any number of languages, each of which are for different purposes, and have different syntax.
I do have some advice. I'm currently writing a small piece of code for my website that can be used to identify programming languages. Like most of the other posts, there could be a huge range of programming languages that you simply haven't heard, you can't account for them all.
What I have done is that each language can be identified by a selection of keywords. For example, Python could be identified in a number of ways. It's probably easier if you pick 'traits' that are also certainly unique to the language. For Python, I choose the trait of using colons to start a set of statements, which I believe is a fairly unique trait (correct me if I'm wrong).
If, in my example, you can't find a colon to start a statement set, then move onto another possible trait, let's say using the def keyword to define a function. Now this can causes some problems, because Ruby also uses the keyword def to define a function. The key to telling the two (Python and Ruby) apart is to use various levels of filtering to get the best match. Ruby use the keyword end to finish a function, whereas Python doesn't have anything to finish a function, just a de-indent but you don't want to go there. But again, end could also be Lua, yet another programming language to add to the mix.
You can see that programming languages simply overlay too much. One keyword that could be a keyword in one language could happen to be a keyword in another language. Using a combination of keywords that often go together, like Java's public static void main(String[] args) helps to eliminate those problems.
Like I've already said, your best chance is looking for relatively unique keywords or sets of keywords to separate one from the other. And, if you get it wrong, at least you had a go.
Set up the random scrambler like
matrix S = matrix(GF(2),k,[random()<0.5for _ in range(k^2)]); while (rank(S) < k) : S[floor(k*random()),floor(k*random())] +=1;

Resources