Language detection for very short text [closed]

Language detection for very short text [closed] - nlp

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm creating an application for detecting the language of short texts, with an average of < 100 characters and contains slang (e.g tweets, user queries, sms).
All the libraries I tested work well for normal web pages but not for very short text. The library that's giving the best results so far is Chrome's Language Detection (CLD) library which I had to build as a shared library.
CLD fails when the text is made of very short words. After looking at the source code of CLD, I see that it uses 4-grams so that could be the reason.
The approach I'm thinking of right now to improve the accuracy is:
Remove brand names, numbers, urls and words like "software", "download", "internet"
Use a dictionary When the text contains a number of short words above a threashold or when it contains too few words.
The dictionary is created from wikipedia news articles + hunspell dictionaries.
What dataset is most suitable for this task? And how can I improve this approach?
So far I'm using EUROPARL and Wikipedia articles. I'm using NLTK for most of the work.

Language detection for very short texts is the topic of current research, so no conclusive answer can be given. An algorithm for Twitter data can be found in Carter, Tsagkias and Weerkamp 2011. See also the references there.

Yes, this is a topic of research and there is some progress that has been made.
For example, the author of "language-detection" at http://code.google.com/p/language-detection/ has created new profiles for short messages. Currently, it supports 17 languages.
I have compared it with Bing Language Detector on a collection of about 500 tweets which are mostly in English and Spanish. The accuracy is as follows:
Bing = 71.97%
Language-Detection Tool with new profiles = 89.75%
For more information, you can check his blog out:
http://shuyo.wordpress.com/2011/11/28/language-detection-supported-17-language-profiles-for-short-messages/

Also omit scientific names or names of medicines etc. Your approach seems quite fine to me. I think wikipedia is the best option for creating a dictionary as it contains standard language. If you are not running out of time, you can also use newspapers.

Related

Checking English Grammar with NLTK [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm starting to use the NLTK library, and I want to check whether a sentence in English is correct or not.
Example:
"He see Bob" - not correct
"He sees Bob" - correct
I read this, but it's quite hard for me.
I need an easier example.

Grammar checking is an active area of NLP research, so there isn't a 100% answer (maybe not even an 80% answer) at this time. The simplest approach (or at least a reasonable baseline) would be an n-gram language model (normalizing LM probabilities for utterance length and setting a heuristic threshold for 'grammatical' or 'ungrammatical'.
You could use Google's n-gram corpus, or train your own on in-domain data. You might be able to do that with NLTK; you definitely could with LingPipe, the SRI Language Modeling Toolkit, or OpenGRM.
That said, an n-gram model won't perform all that well. If it meets your needs, great, but if you want to do better, you'll have to train a machine-learning classifier. A grammaticality classifier would generally use features from syntactic and/or semantic processing (e.g. POS-tags, dependency and constituency parses, etc.) You might look at some of the work from Joel Tetrault and the team he worked with at ETS, or Jennifer Foster and her team at Dublin.
Sorry there isn't an easy and straightforward answer...

Beginner's guide to ElasticSearch [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
There hasn't been any books about ElasticSearch (that I know of), and http://www.elasticsearch.org/guide/ seems to contain only references.
Any good beginner's guide or tutorials, perhaps by examples, to recommend, especially in terms of the different mapping and indexing strategies?

Edit (April 2015):
As many have noticed, my old blog is now defunct. Most of my articles were transferred over to the Elastic blog, and can be found by filtering on my name: https://www.elastic.co/blog/author/zachary-tong
To be perfectly honest, the best source of beginner knowledge is now Elasticsearch - The Definitive Guide written by myself and Clinton Gormley.
It assumes zero search engine knowledge and explains information retrieval first principals in context of Elasticsearch. While the reference docs are all about finding the precise parameter you need, the Guide is a narrative that discusses problems in search and how to solve them.
Best of all, the book is OSS and free (unless you want to buy a paper copy, in which case O'Reilly will happily sell you one :) )
Edit (August 2013):
Many of my articles have been migrated over to the official Elasticsearch blog, as well as new articles that have not been published on my personal site.
Original post:
I've also been frustrated with learning ElasticSearch, having no Lucene/Solr experience. I've been slowly documenting things I've learned at my blog, and have four tutorials written so far:
So I don't have to keep editing, all future tutorials on my blog can be found under this category link.
And these are some links that I have bookmarked, because they have been incredibly helpful in one way or another:
Thinking through and debugging problems with your query
Another example of complicated mapping (ngram, synonyms, phonemes)
Searching parts of a word
Fun with ElasticSearch's children and nested documents

You can Learn the overview using this link
http://spinscale.github.com/elasticsearch/2012-03-jugm.html#/1

I found Elastic Search one of the hardest things I've had to learn, I hadn't used Lucene before and I found the documentation to be quite hard to follow.
These are the things that I wish I'd known before I started learning it:
Configuration and setup
I configured ELS to run on 3 VM' using Centos, Mint and Ubuntu. Centos was by far the best choice of the three.
I followed this guide to help me set it up (it worked fine on all three distros)
Index and types
One Index can contain many types, it's by using types that you can achieve a good degree of separation of data that belongs within the same index.
PHP
I use PHP as a front end and used this wrapper to integrate my ELS installation into my scripts.
Other resources
The presentation in the other answer to your question is really good, go through it and learn the DSL Query syntax, once setup this is where the real power of ELS comes into its own.

If you're new to elasticsearch and the “information retrieval” / “fulltext search” in general, my advice would be to check these resources first, before trying out tutorials on specific features:
The Your Data, Your Search, ElasticSearch presentation from EURUKO 2011
The ElasticSearch - A Distributed Search Engine talk by Shay Bannon together with accompanying scripts
The Lucene in Action book (at least the general chapters on the indexing, analysis, tokenization, and constructing queries)

Best turnkey relation detection library? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
What is the best turnkey (ready to use, industrial-strength) relation detection library?
I have been playing around with NLTK and the results I get are not very satisfactory.
http://nltk.googlecode.com/svn/trunk/doc/book/ch07.html
http://nltk.googlecode.com/svn/trunk/doc/howto/relextract.html
Ideally, I would like a library that can take sentences like:
"Sarah killed a wolf that was eating a child"
and turn it into a data structure that means something like:
killed(Sarah, wolf) AND eating(wolf,child)
I know that this is the subject of a large body of research and that it is not an easy task. That said, is anyone aware of a reasonably robust ready-to-use library for detecting relations?

Update: Extractiv is no longer available.
Extractiv's On-Demand REST service:
http://rest.extractiv.com/extractiv/?url=https://stackoverflow.com/questions/4732686/best-turnkey-relation-detection-library&output_format=html_viewer will process this page, extract and display the two semantic triples you desire in the bottom left corner under "GENERIC". (It throws away some of the text from the page in the html viewer, but this text is not thrown away if you utilize json or rdf output).
This is assuming you're open to a commercial, industrial strength solution, though limited free usage is allowed. It's a web service but open source libraries can be used to access it or could be purchased from Language Computer Corporation.

These relations can be read fairly easily out of the output of dependency notations. For instance, put into the Stanford Parser online, you can see both of the two subject-verb-object triples in your example in the typed dependencies collapsed representation as:
nsubj(killed-2, Sarah-1)
dobj(killed-2, wolf-4)
nsubj(eating-7, wolf-4)
dobj(eating-7, child-9)

Learning a new language project [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Does anyone have a standard project that they use when learning a new language. Kinda like a specification document of a project that includes all aspects of programming. Does anyone use some sort of beginning type project when learning a new language? I guess it also depends on the type of language and what's it's capable of.

Contributing something to an open source project seems to work for me. In addition to getting exposed to some coding habits in the language , you get to work on something useful.

Going through the first few problems of Project Euler is a very good way to get a handle on topics like I/O, recursion, iteration, and basic data structures. I'd highly recommend it.

A friend of mine had a coworker who coded a minesweeper every time when he wanted to learn a new language with GUI.

I like making simple websites for learning.
Pro: you can put it online and show it to people.
Con: the language has to be suitable for web development.

Writing a simple ray tracer:
math functions (pow, sqrt, your own intersection routines)
recursion (because it is a whitted style recursive one)
iteration (for all pixels)
how to write custom types (rays, possibly vectors)
pixel wise graphics
have something to play with compiler's (optimization-) flags
optional:
simple GUI
file reading writing
I've also done so with metatrace.

Are units of measurement unique to F#? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I was reading Andrew Kennedy's blog post series on units of measurement in F# and it makes a lot of sense in a lot of cases. Are there any other languages that have such a system?
Edit: To be more clear, I mean the flexible units of measurement system where you can define your own arbitrarily.

Does TI-89 BASIC count? Enter 54_kg * (_c^2) and it will give you an answer in joules.
Other than that, I can't recall any languages that have it built in, but any language with decent OO should make it simple to roll your own. Which means someone else probably already did.
Google confirms. For example, here's one in Python. __repr__ could easily be amended to also select the most appropriate derived unit, etc.
CPAN has several modules for Perl: Physics::Unit, Data::Dimensions, Class::Measure, Math::Units::PhysicalValue, and a handful of others that will convert but don't really combine values with units.

Nemerle had compiler-checked Units of Measure in 2006.
http://nemerle.org
http://nemerle.org/forum.old/viewtopic.php?t=265&view=previous&sid=00f48f33fafd3d49cc6a92350b77d554

C++ has it, in the form of boost::units.

I'm not sure if this really counts, but the RPL system on my HP-48 calculator does have similar features. I can write 40_gal 5_l + and get the right answer of 156.416 liters.

I believe I saw that Fortress support this, I'll see if I can find a link.
I can't find a specific link, but the language specification makes mention of it in a couple of places. The 1.0 language specification also says that dimensions and units were temporarily dropped from the specification (along with a whole heap of other features) to match up with the current implementation. It's a work in progress, so I guess things are in flux.

F# is the first mainstream language to support this feature.

There is also a Java specification for units at http://jcp.org/en/jsr/detail?id=275 and you can already use it from here http://jscience.org/

Nemerle has something much better than F# !
You should check this one : http://rsdn.ru/forum/src/1823225.flat.aspx#1823225 .
It is really great .
And you can download here : http://rsdn.ru/File/27948/Oyster.Units.0.06.zip
Some example:
def m3 = 1 g;
def m4 = Si.Mass(m1);
WriteLine($"Mass in SI: $m4, in CGS: $m3");
def x1 = Si.Area(1 cm * 10 m);
WriteLine($"Area of 1 cm * 10 m = $x1 m");

I'm pretty sure Ada has it.

well I made QuantitySystem library especially for units in C#, however its not compile time checking
but I've tried to make it run as I wanted
also it supports expansion so you can define your unique units
http://QuantitySystem.CodePlex.com
also it can differentiate between Torque and Work :) [This was important for me]
the library approach is from Dimension to units
all I've seen till now units only approach.

I'm sure you'd be able to do this with most dynamic languages (javascript, python, ruby) by carefully monkey-patching some of the base-classes. You might get into problems though when working with imperial measurements.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Language detection for very short text [closed] - nlp

Language detection for very short texts is the topic of current research, so no conclusive answer can be given. An algorithm for Twitter data can be found in Carter, Tsagkias and Weerkamp 2011. See also the references there.

Also omit scientific names or names of medicines etc. Your approach seems quite fine to me. I think wikipedia is the best option for creating a dictionary as it contains standard language. If you are not running out of time, you can also use newspapers.

Related

Checking English Grammar with NLTK [closed]

Beginner's guide to ElasticSearch [closed]

Best turnkey relation detection library? [closed]

Learning a new language project [closed]

Are units of measurement unique to F#? [closed]

Categories

Resources