Difference between dependencies(basic and enhanced) from Stanford CoreNLP? - nlp

Difference between basic-dependencies,collapsed-dependencies and collapsed-ccprocessed-dependencies in Stanford CoreNLP and how to use them to understand query?

A good way to see the difference by example is with the online demo (corenlp.run). basic, collapsed, and cc-processed are roughly the old ("Stanford Dependencies") equivalents to basic, enhanced, and enhanced++ in the newer ("Universal Dependencies") representation.
At a high level, the basic dependencies are meant to be easier to parse -- e.g., they're always a tree, the label set is small, etc. The enhanced[++] dependencies (like their predecessors, "collapsed" and "cc-processed") are deterministic transformations on the basic dependencies that are intended to make them a bit easier to work with, and a bit more semantic. For example, by labelling a preposition on the arc (prep:of in Stanford Dependencies; nmod:of in Universal Dependencies).
The full documentation of the differences (for Universal Dependencies) can be found in: Schuster and Manning (2016). "Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks". The original Stanford Dependencies are perhaps best documented in the Stanford Dependencies Manual.

Related

Natural Language Processing Libraries

I'm having a hard time figuring out what library and datasets go together.
Toolkits / Libraries I've found:
CoreNLP - Java
NLTK - Python
OpenNLP - Java
ClearNLP - Java
Out of all of these, some are missing features. For example OpenNLP didn't have a dependency parsing.
I need to find a library that's quick that will also do dependency parsing and part of speech tagging.
The next hurdle is where do we get data sets. I've found a lot of things out there, but nothing full and comprehensive.
Data I've found:
NLTK Corpora
English Web Treebank (looks to be the best but is paid)
OpenNLP
Penn Treebank
I'm confused as to what data sets I need for what features and what's actually available publicly. From my research is seems ClearNLP will work best for but has very little data.
Thank you
Stanford CoreNLP provides both POS tagging and dependency parsing out of the box (plus many other features!), it already has trained models so you don't need any data sets for it work!
Please let me know if you have any more questions about the toolkit!
http://nlp.stanford.edu/software/corenlp.shtml

Detecting language using Stanford NLP

I'm wondering if it is possible to use Stanford CoreNLP to detect which language a sentence is written in? If so, how precise can those algorithms be?
Almost certainly there is no language identification in Stanford COreNLP at this moment. 'almost' - because nonexistence is much harder to prove.
EDIT: Nevertheless, below are circumstantial evidences:
there is no mention of language identification neither on main
page, nor CoreNLP page, nor in FAQ (although there is
a question 'How do I run CoreNLP on other languages?'), nor in 2014
paper of CoreNLP's authors;
tools that combine several NLP libs
including Stanford CoreNLP use another lib for language
identification, for example DKPro Core ASL; also other
users talking about language identification and CoreNLP don't mention this capability
source file of CoreNLP contains Language
classes, but nothing related to language identification - you can
check manually for all 84 occurrence of 'language' word here
Try TIKA, or TextCat, or Language Detection Library for Java (they report "99% over precision for 53 languages").
In general, quality depends on the size of input text: if it is long enough (say, at least several words and not specially chosen), then precision can be pretty good - about 95%.
Standford CoreNLP doesn't have language ID (at least not yet), see http://nlp.stanford.edu/software/corenlp.shtml
There are loads more on language detection/identification tools. But do take the reported precision with a pinch of salt. It is usually evaluated narrowly, bounded by:
a fix list of languages,
a substantial length of the test sentences and
of the same language and
a skewed proportion of training to testing instances.
Notable language ID tools includes:
TextCat (http://cran.r-project.org/web/packages/textcat/index.html)
CLD2 (https://code.google.com/p/cld2/)
LingPipe (http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html)
LangID (https://github.com/saffsd/langid.py)
CLD3 (https://github.com/google/cld3)
An exhaustive list from meta-guide.com, see http://meta-guide.com/software-meta-guide/100-best-github-language-identification/
Noteworthy Language Identification related shared task (with training/testing data) includes:
Native Language ID (NLI 2013)
Discriminating Similar Languages (DSL 2014)
TweetID (2015)
Also take a look at:
Language Identification: The Long and the Short of the Matter
The Problems of Language Identification within Hugely Multilingual Data Sets
Selecting and Weighting N-Grams to Identify 1100 Languages
Indigenous Tweets
Microblog Language Identification: Overcoming the Limitations of Short, Unedited and Idiomatic Text

What's the difference between Stanford Tagger, Parser and CoreNLP?

I'm currently using different tools from Stanford NLP Group and trying to understand the differences between them. It seems to me that somehow they intersect each other, since I can use same features in different tools (e.g. tokenize, and POS-Tag a sentence can be done by Stanford POS-Tagger, Parser and CoreNLP).
I'd like to know what's the actual difference between each tool and in which situations I should use each of them.
All Java classes from the same release are the same, and, yes, they overlap. On a code basis, the parser and tagger are basically subsets of what is available in CoreNLP, except that they do have a couple of little add-ons of their own, such as the GUI for the parser. In terms of provided models, the parser and tagger come with models for a range of languages, whereas CoreNLP ships only with English out of the box. However, you can then download language-particular jars for CoreNLP which provide all the models we have for different languages. Anything that is available in any of the releases is present in the CoreNLP github site: https://github.com/stanfordnlp/CoreNLP

Pros/cons of different language workbench tools such as Xtext and MPS?

Does anyone have experience working with language workbench tools such as Xtext, Spoofax, and JetBrains' MPS? I'm looking to try one out and am having a hard time finding a good comparison of the different tools. What are the pros and cons of each?
I'm looking to build DSLs that generate python code, so I'm especially interested to hear from people who've used one of these tools with python (all three seem pretty Java-focused... why is that?). The DLSs are primarily for my own use, so I care less about building a really pretty IDE than I do about it being KISS to define the syntax and write the code generator. The ability to type-check / do static analysis of the DLSs would be pretty cool too.
I'm a little afraid of getting far down a path, hitting a wall, and realizing that all my code is in a format that can't be ported to anything else -- is that a risk with these tools? MPS in particular seems a little scary since as I understand it you don't really generate text-based syntaxes but rather build specialized editors for ASTs.
Markus Voelter does a pretty good job comparing those three in se-radio and Software ArchitekTOUR podcasts.
The basic idea is, that Xtext is most used, therefore most stable and documented, and it is based on popular Eclipse platform and modeling ecosystem - EMF which surrounds it. On the other hand it is parser based and uses ANTLR internally, which means the kind of grammars you can define is limited and languages cannot be combined easily.
Spoofax is an academic product with least adoption of those three. It is also parser based, but uses its own parser generator internally which allows language combinations.
Jetbrains MPS is projection based, which gives much freedom to language designer and allows combinations of languages. *t also has solid support. Drawback might be the learning curve.
None of these tools is strictly Java focused as target language for code generators. Xtext uses Xpand templates, which are plain text. I don't really know how code generation in Spoofax works. MPS has its base language, which is said to be subset of Java, but there are different alternatives.
I personally use Xtext because of its simplicity and maturity, but those strong limitations given by its design make it not a very future proof choice.
I have chosen XText in the same case two weeks ago, but I don't know anything about Spoofax.
My first impression - Xtext is very simple and productive.
I have made my first realife(but very simple) project in 30 minutes, I have generated a graphviz dot graph and html report.
I don't like MPS because I prefer plain text source and destination files.
There are other systems for doing this kind of thing. If your goal is building tools, you don't necessarily have to look to an IDE with an integrated tool; sometimes you can find better tools that have focused on utility rather than IDE integration
Consider any of the pure program transformation tools:
TXL (practical, single paradigm)
Stratego (Spoofax before it was transplanted into Eclipse)
Rascal (research, very nicely designed in many ways)
DMS Software Reengineering Toolkit (happens to be mine; commercial; used to do heavy duty DSL/conventional langauge analysis and transformation including on C++)
These all provide good mechanisms for defining DSLs and transforming them.
What really matters is the support machinery for carrying out "life after parsing".
I 've experimented for a couple of days with Xtext and while the tool looks promising I was eventually put off by the tight integration with the Eclipse ecosystem and the pain one has to go through just to solve what should be given hassle-free out of the box: a headless run of the code generator you implemented. See here for some of the minutiae one has to go through (and it's not even properly documented on the Xtext web site but rather on a blog, meaning its an ad-hoc patch that could very well break on the next release).
Will take another look in half a year to see if there has been any improvement on this front.
Take a look at the Markus Völter's book. It does a very comprehensive comparison of these 3 technologies.
http://dslbook.org
XText is very well maintained but this doesn't mean it's problem-less. Getting type-system, scoping and generation running isn't as easy as advertised.
Spoofax is scannerless, (simplifying grammar composition). Not that well documented, but seems complete.
MPS is projectional. A pro for language composition and con for editing. Supports multiple editors for an AST and will soon even support a nice diagram editor. Base language documentation isn't that good. Typesystem, scoping, checking is very well handled. Model to model transformations are done by the solver. My colleagues using it complain about model to text languages. (My opinion M2M wasn't that intuitive either.)
Years ago Microsoft had the OSLO project. MGrammar and especially Quadrant were very promising. It was possible to represent your model in table, form, text or diagram view. But suddenly they've cancelled the project (and perhaps shot the people working on it)
Perhaps today the best place to compare different language workbenches is http://www.languageworkbenches.net/ and there http://www.languageworkbenches.net/past-editions/ shows how a set of Language Workbenches implement a similar kind of task: a dsl for a particular domain.
Update 2022: as links were broken and newer articles on the topic are written see the site referred above at:
https://web.archive.org/web/20160324201529/http://www.languageworkbenches.net/
References to article reviewing language workbenches include: 1) State of the art: https://link.springer.com/chapter/10.1007/978-3-319-02654-1_11 and 2) Empirical evaluation: https://hal.archives-ouvertes.fr/file/index/docid/706841/filename/Evaluation_of_Modeling_Tools_Adaptation.pdf

Natural language processing [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
Question is maybe ( about 100%) subjective but I need advices. What is best language for natural language processing ? I know Java and C++ but is there easier way to do it. To be more specific I need to process texts from lot of sites and get information.
As I said in comments, the question is not about a language, but about suitable library. And there are a lot of NLP libraries in both Java and C++. I believe you must inspect some of them (in both languages) and then, when you will know all the plenty of available libraries, create some kind of "big plan", how to implement your task. So, here I'll just give you some links with a brief explanation what is what.
Java
GATE - it is exactly what its name means - General Architecture for Text Processing. Application in GATE is a pipeline. You put language processing resources like tokenizers, POS-taggers, morphological analyzers, etc. on it and run the process. The result is represented as a set of annotations - meta information, attached to a peace of text (e.g. token). In addition to great number of plugins (including plugins for integration with other NLP resources like WordNet or Stanford Parser), it has many predefined dictionaries (cities, names, etc.) and its own regex-like language JAPE. GATE comes with its own IDE (GATE Developer), where you can try your pipeline setup, and then save it and load from Java code.
UIMA - or Unstructured Information Management Applications. It is very similar to GATE in terms of architecture. It also represents pipeline and produces set of annotations. Like GATE, it has visual IDE, where you can try out your future application. The difference is that UIMA mostly concerns information extraction while GATE performs text processing without explicit consideration of its purpose. Also UIMA comes with simple REST server.
OpenNLP - they call themselves organization center for open source projects on NLP, and this is the most appropriate definition. Main direction of development is to use machine learning algorithms for the most general NLP tasks like part-of-speech tagging, named entity recognition, coreference resolution and so on. It also has good integration with UIMA, so its tools are also available.
Stanford NLP - probably best choice for engineers and researchers with NLP and ML knowledge. Unlike libraries like GATE and UIMA, it doesn't aim to provide as much tools as possible, but instead concentrates on idiomatic models. E.g. you don't have comprehensive dictionaries, but you can train probabilistic algorithm to create it! In addition to its CoreNLP component, that provides most wildly used tools like tokenization, POS tagging, NER, etc., it has several very interesting subprojects. E.g. their Dependency framework allows you to extract complete sentence structure. That is, you can, for example, easily extract information about subject and object of a verb in question, which is much harder using other NLP tools.
C++
UIMA - yes, there are complete implementations for both Java and C++.
Stanford Parser - some Stanford's projects are only in Java, others - only in C++, and some of them are available in both languages. You can find many of them here.
APIs
A number of web service APIs perform specific language processing, including:
Alchemy API - language identification, named entity recognition, sentiment analysis and much more! Take a look at their main page - it is quite self-descriptive.
OpenCalais - this service tries to build giant graph of everything. You pass it a web page URL and it enriches this page text with found entities, together with relations between them. For example, you pass it a page with "Steve Jobs" and it returns "Apple Inc." (roughly speaking) together with probability that this is the same Steve Jobs.
Other recommendations
And yes, you should definitely take a look at Python's NLTK. It is not only a powerful and easy-to-use NLP library, but also a part of excellent scientific stack created by extremely friendly community.
Update (2017-11-15): 7 years later there are even more impressive tools, cool algorithms and interesting tasks. One comprehensive description may be found here:
https://tomassetti.me/guide-natural-language-processing/
Python and NLTK
ScalaNLP, which is a Natural Language Processing library written in Scala, seems suitable for your job.
I would recommend Python and NLTK.
Some hints and notes I can pinpoint based on my experience using it:
Python has efficient list, strings handling. You can index lists very efficiently what in natural language should be a fact. Also has nice syntactic delicacies, for example to access the first 100 words of a list, you can index as list[:100] (compare it with stl in c++).
Python serialization is easy and native. The serialization modules make language processing corpus and text handling an easy task, one line of code.(compare it with the several lines using Boost or other libraries of C++)
NLTK provides classes for loading corpus, processing it, tagging, tokenization, grammars parsing, chunking, and a whole set of machine learning algorithms, among other stuff. Also it provides good resources for probabilistic models based on words distribution in text. http://www.nltk.org/book
If learning a new programming language is an obstacle, you can check openNLP for Java http://incubator.apache.org/opennlp/

Resources