Best Open source / free NLP engine for the job [closed] - nlp

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Let's say that I have a pull (a list) of well known phrases, like:
{ "I love you", "Your mother is is a ...", "I think I am pregnant" ... } Let's say about a 1000 like these. And now I want the users to enter free text into a text box, and put some kind of NLP engine to digest the text and find the 10 most relevant phrases from the pull that may be related in a way to the text.
I thought that the simplest implementation could be looking by the words. Picking each time one word and looking for similarities in some way. Not sure which?
What most frightens me is the size of a vocabulary that I must support. I am a single developer of some kind of a demo, and I don't like the idea of filling in words into a table...
I am looking for a free NLP engine. I am agnostic about the language it's written in, but it must be free - NOT some kind of an online service that charges by API calls..

It seems that TextBlob and ConeptNet are more than adequate solution to this problem!

TextBlob is an easy-to-use NLP library for Python that is free and open source (licensed under the permissive MIT License). It provides a nice wrapper around the excellent NLTK and pattern libraries.
One simple approach to your problem would be to extract noun phrases from your given text.
Here's an example from the TextBlob docs.
from text.blob import TextBlob
text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''
blob = TextBlob(text)
print(blob.noun_phrases)
# => ['titular threat', 'blob', 'ultimate movie monster', ...]
This could be a starting point. From there you could experiment with other methods, such as similarity methods as mentioned in the comments or TF-IDF. TextBlob also makes it easy to swap models for noun phrase extraction.
Full disclosure: I am the author of TextBlob.

Related

Natural Language Processing books or resource for entry level person? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Can anyone gives some suggestions for good natural language processing book. Following are the factors I have in mind:
It gives a good overview of these huge topics without too much depth.
Concepts need to explain in picture form.
Sample code in JAVA/Python/R.
You can look at online courses about NLP. They oftain contain videos, exercices, writing documents, suggested readings...
I especially like this one : https://www.coursera.org/course/nlp (see suggested readings section for instance). You can access the lectures here : https://class.coursera.org/nlp/lecture (pdf + video + subtitles).
I believe there are three options for you--I wrote one of them so take this with a grain of salt.
1) Natural Language Processing with Python
by Steven Bird et al. http://amzn.com/0596516495. This book covers using the NLP api NLTK and is considered a solid book for intro to NLP. Lots of code, a more academic take on what NLP is and I assume broadly used in undergraduate NLP classes.
2) Natural Language Processing with Java by Richard Reese http://amzn.to/1D0liUY. This covers a range of APIs, including LingPipe below, and introduces NLP concepts and how they are implemented in a range of open source APIs. It is a more shallow dive into NLP but it is a gentler introduction and it covers how a bunch of APIs solve the same problem so it may help you pick what API to use.
3) Natural Language Processing with Java and LingPipe Cookbook by Breck Baldwin (me) and Krishna Dayanidhi http://amzn.to/1MvgHxa. This is meant for industrial programmers and it covers the concepts common in commercial NLP applications. The book is a much deeper dive into evaluation, problem specification, varied technologies that on the face do the same thing. But it expects you to learn from examples (overwhelmingly Twitter data).
All the books have lots of code, one in Python, the other two in Java. Both present mature APIs with a large installed base.
None of the books do much in the way of graphical explanation of what the software is doing.
Good luck

What do I need to know on NLP to be able to use and train Stanford NLP for intent analysis? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Any books, tutorials, course reccommedations would be much appreciated.
I need to know at what level I need to be regarding NLP to be able to comprehend the Stanford NLP and train it to customize it for my app of commercial sentiment analysis.
My goal is not a career in NLP or become an expert in NLP but only to be as much proficient to be able to understand and use the open source NLP frameworks properly and train them for my application.
For this level, what NLP study/training would be needed?
I'm learning c# and .net as well.
First: to simply use a sentiment model or train on existing data, there is not too much background to learn:
Tokenization
Constituency parsing, parse trees, etc.
Basic machine learning concepts (classification, cost functions, training / development sets, etc.)
These are well-documented ideas and are all a Google away. It might be worth it to skim the Coursera Natural Language Processing course (produced by people here at Stanford!) for the above ideas as well.
After that, the significant task is understanding how the RNTN sentiment model inside CoreNLP works. You don't need to grasp the math fully, I suppose, but the basic recursive nature of the algorithm is important to understand. The best resource is of course the original paper (and there's not much else, to be honest).
To train your own sentiment model, you'll need your own sentiment data. Producing this data is no small task. The data for the Stanford sentiment model was crowdsourced, and you may need to do something similar if you want to collect anything near the same scale.
The RNTN sentiment paper (linked above) gives some details on the data format. I'm happy to expand on this further if you do wish to create your own data.
I think you should simply comprehend the concept of supervised learning, unsupervised learning. In addition, some Java knowledge might be useful.

Diagrammatic method to model software components, their interactions & I/O [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
I'd like to model software components and their interaction between them, what information is passed, what processes take place in each component(not too detailed) and a clear specification of the input/output of the components.
What i've seen so far in UML is far too abstract and doesn't go into too much detail.
Any suggestions?
Someg guys Design programs on papers as diagrams,
Then pass them to software developer to Contruct.
This appraoach is tried: "Clever guys" do modeling, and pass models to "ordinary" developers to do laborious task. And this not worked.
We like analogies. So many times we make analogy to construction industry where some guys do models-bluprints and other do building-contruction.And we first think that UML or other models diagrams are equivalent to construction industry models-blueprints. But it seems that we are wrong.
To make an analogy with construction industry our blueprints are not
models-diagrams, our blueprints are actually the code we write.
Detailed Paper Models like Cooking Receipes
It is not realistic to design a software system entirely on a paper with detailed models upfront.Software development is iterative and incremental process.
Think of a map maker who make a paper map of city as big as city, since the modeler include every details without any abstraction level.Will it be usefull?
Is Modeling Useless ?
Definitely not. But you should apply it to difficult part of your problem-solution space, not every trival part of them.
So instead of giving every details of system on paper to developers, explore difficult part of problem-solution space with developers face to face using visual diagrams.
In software industry like it or hate it, Source Code is still the
King. And all models are liar until they are implemented and tested

Is there a search engine that will give a direct answer? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I've been wondering about this for a while and I can't see why Google haven't tried it yet - or maybe they have and I just don't know about it.
Is there a search engine that you can type a question into which will give you a single answer rather than a list of results which you then have to trawl through yourself to find what you want to know?
For example, this is how I would design the system:
User’s input: “Where do you go to get your eyes tested?”
System output: “Opticians. Certainty: 95%”
This would be calculated as follows:
The input is parsed from natural language into a simple search string, probably something like “eye testing” in this case. The term “Where do you go” would also be interpreted by the system and used when comparing results.
The search string would be fed into a search engine.
The system would then compare the contents of the results to find matching words or phrases taking note of what the question is asking (i.e. what, where, who, how etc.)
Once a suitable answer is determined, the system displays it to the user along with a measure of how sure it is that the answer is correct.
Due to the dispersed nature of the Internet, a correct answer is likely to appear multiple times, especially for simple questions. For this particular example, it wouldn’t be too hard for the system to recognise that this word keeps cropping up in the results and that it is almost certainly the answer being searched for.
For more complicated questions, a lower certainty would be shown, and possibly multiple results with different levels of certainty. The user would also be offered the chance to see the sources which the system calculated the results from.
The point of this system is that it simplifies searching. Many times when we use a search engine, we’re just looking for something really simple or trivial. Returning a long list of results doesn’t seem like the most efficient way of answering the question, even though the answer is almost certainly hidden away in those results.
Just take a look at the Google results for the above question to see my point:
http://www.google.co.uk/webhp?sourceid=chrome-instant&ie=UTF-8&ion=1&nord=1#sclient=psy&hl=en&safe=off&nord=1&site=webhp&source=hp&q=Where%20do%20you%20go%20to%20get%20your%20eyes%20tested%3F&aq=&aqi=&aql=&oq=&pbx=1&fp=72566eb257565894&fp=72566eb257565894&ion=1
The results given don't immediately answer the question - they need to be searched through by the user before the answer they really want is found. Search engines are great directories. They're really good for giving you more information about a subject, or telling you where to find a service, but they're not so good at answering direct questions.
There are many aspects that would have to be considered when creating the system – for example a website’s accuracy would have to be taken into account when calculating results.
Although the system should work well for simple questions, it may be quite a task to make it work for more complicated ones. For example, common misconceptions would need to be handled as a special case. If the system finds evidence that the user’s question has a common misconception as an answer, it should either point this out when providing the answer, or even simply disregard the most common answer in favour of the one provided by the website that points out that it is a common misconception. This would all have to be weighed up by comparing the accuracy and quality of conflicting sources.
It's an interesting question and would involve a lot of research, but surely it would be worth the time and effort? It wouldn't always be right, but it would make simple queries a lot quicker for the user.
Such a system is called an automatic Question Answering (QA) system, or a Natural Language search engine. It is not to be confused with a social Question Answering service, where answers are produced by humans. QA is a well studied area, as evidenced by almost a decade of TREC QA track publications, but it is one of the more difficult tasks in the field of natural language processing (NLP) because it requires a wide range of intelligence (parsing, search, information extraction, coreference, inference). This may explain why there are relatively few freely available online systems today, most of which are more like demos. Several include:
AnswerBus
START - MIT
QuALiM - Microsoft
TextMap - ISI
askEd!
Wolfram Alpha
Major search engines have shown interest in question answering technology. In an interview on Jun 1, 2011, Eric Scmidt said, Google’s new strategy for search is to provide answers, not just links. "'We can literally compute the right answer,' said Schmidt, referencing advances in artificial intelligence technology" (source).
Matthew Goltzbach, head of products for Google Enterprise has stated that "Question answering is the future of enterprise search." Yahoo has also forecasted that the future of search involves users getting real-time answers instead of links. These big players are incrementally introducing QA technology as a supplement to other kinds of search results, as seen in Google's "short answers".
While IBM's Jeopardy-playing Watson has done much to popularize machines answering question (or answers), many real-world challenges remain in the general form of question answering.
See also the related question on open source QA frameworks.
Update:
2013/03/14: Google and Bing search execs discuss how search is evolving to conversational question answering (AllThingsD)
Wolfram Alpha
http://www.wolframalpha.com/
Wolfram Alpha (styled Wolfram|Alpha)
is an answer engine developed by
Wolfram Research. It is an online
service that answers factual queries
directly by computing the answer from
structured data, rather than providing
a list of documents or web pages that
might contain the answer as a search
engine would.[4] It was announced in
March 2009 by Stephen Wolfram, and was
released to the public on May 15,
2009.[1] It was voted the greatest computer innovation of 2009 by Popular
Science.[5][6]
http://en.wikipedia.org/wiki/Wolfram_Alpha
Have you tried wolframalpha?
Have a look at this: http://www.wolframalpha.com/input/?i=who+is+the+president+of+brasil%3F
Ask Jeeves, now Ask.com, used to do this. Why nobody does this anymore, except Wolfram:
Question Answering (QA) is far from a solved problem.
There exist strong question answering systems, but they require full parsing of both the question and the data and therefore require tremendous amounts of computing power and storage, even compared to Google scale, to get any coverage.
Most web data is too noisy to handle; you first have to detect if it's in a language you support (or translate it, as some researchers have done; search for "cross-lingual question answering"), then try to detect noise, then parse. You lose more coverage.
The internet changes at lightning pace. You lose even more coverage.
Users have gotten accustomed to keyword search, so that's much more economical.
Powerset, acquired by Microsoft, is also trying to do question answering. They call their product a "natural language search engine" where you can type in a question such as "Which US State has the highest income tax?" and search on the question instead of using keywords.

About "AUTOMATIC TEXT SUMMARIZER (lingustic based)" [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I am having "AUTOMATIC TEXT SUMMARIZER (linguistic approach)" as my final year project. I have collected enough research papers and gone through them. Still I am not very clear about the 'how-to-go-for-it' thing. Basically I found "AUTOMATIC TEXT SUMMARIZER (statistical based)" and found that it is much easier compared to my project. My project guide told me not to opt this (statistical based) and to go for linguistic based.
Anyone who has ever worked upon or even heard of this sort of project would be knowing that summarizing any document means nothing but SCORING each sentence (by some approach involving some specific algos) and then selecting sentences having score more than threshold score. Now the most difficult part of this project is choosing the appropriate algorithm for scoring and later implementing it.
I have moderate programming skills and would like to code in JAVA (because there I'll get lots of APIs resulting in lesser overheads). Now I want to know that for my project, what should be my approach and algos used. Also how to implement them.
Using Lexical Chains for Text Summarization (Microsoft Research)
An analysis of different algorithms: DasMartins.2007
Most important part in the doc:
• Nenkova (2005) analyzes that no system
could beat the baseline with statistical
significance
• Striking result!
Note there are 2 different nuances to the liguistic approach:
Linguistic rating system (all clear here)
Linguistic generation (rewrites sentences to build the summary)
Automatic Summarization is a pretty complex area - try to get your java skills first in order as well as your understanding of statistical NLP which uses machine learning. You can then work through building something of substance. Evaluate your solution and make sure you have concretely defined your measurement variables and how you went about your evaluation. Otherwise, your project is doomed to failure. This is generally considered a high risk project for final year undergraduate students as they often are unable to get the principles right and then implement it in a way that is not right either and then their evaluation measures are all ill defined and don't reflect on their own work clearly. My advice would be to focus on one area rather then many in summarization as you can have single and multi document summaries. The more varied you make your project the less likely hold of you receiving a good mark. Keep it focused and in depth. Evaluate other peoples work then the process you decided to take and outcomes of that.
Readings:
-Jurafsky book on NLP there is a back section on summarization and QA.
-Advances in Text Summarization by inderjeet mani is really good
Understand what things like term weighting, centroid based summarization, log-likelihood ratio, coherence relations, sentence simplification, maximum marginal relevance, redundancy, and what a focused summary actually is.
You can attempt it using a supervised or an unsupervised approach as well as a hybrid.
Linguistic is a safer option that is why you have been advised to take that approach.
Try attempting it linguistically then build statistical on to hybridize your solution.
Use it as an exercise to learn the theory and practical implication of the algorithms as well as build on your knowledge. As you will no doubt have to explain and defend your project to the judging panel.
If you really have read those research papers and research books you probably know what is known. Now it is up to you to implement the knowledge of those research papers and research books in a Java application. Or you could expand the human knowledge by doing some innovation/invention. If you do expand human knowledge you have become a true scientist.
Please make your question more specific, in these two main areas:
Project definition: What is the goal of your project?
Is the input unit a single document? A list of documents?
Do you intend your program to use machine learning?
What is the output?
How will you measure success?
Your background knowledge: You intend to use linguistic rather than statistical methods.
Do you have background in parsing natural language? In semantic representation?
I think some of these questions are tough. I am asking them because I spent too much time trying to answer similar questions in the course of my studies. Once you get these sorted out, I may be able to give you some pointers. Mani's "Automatic Summarization" looks like a good start, at least the introductory chapters.
The University of Sheffield did some work on automatic email summarising as part of the EU FASiL project a few years back.

Resources