I'm a newbie to AI. Can you suggest networks of libraries that use artificial intelligence and give out keywords when analysing text?
Related
I am building an NLP pipeline and I am trying to get my head around in regards to the optimal structure. My understanding at the moment is the following:
Step1 - Text Pre-processing [a. Lowercasing, b. Stopwords removal, c. stemming, d. lemmatisation,]
Step 2 - Feature extraction
Step 3 - Classification - using the different types of classifier(linearSvC etc)
From what I read online there are several approaches in regard to feature extraction but there isn't a solid example/answer.
a. Is there a solid strategy for feature extraction ?
I read online that you can do [a. Vectorising usin ScikitLearn b. TF-IDF]
but also I read that you can use Part of Speech or word2Vec or other embedding and Name entity recognition.
b. What is the optimal process/structure of using these?
c. On the text pre-processing I am ding the processing on a text column on a df and the last modified version of it is what I use as an input in my classifier. If you do feature extraction do you do that in the same column or you create a new one and you only send to the classifier the features from that column?
Thanks so much in advance
The preprocessing pipeline depends mainly upon your problem which you are trying to solve. The use of TF-IDF, word embeddings etc. have their own restrictions and advantages.
You need to understand the problem and also the data associated with it. In order to make the best use of the data, we need to implement the proper pipeline.
Specifically for text related problems, you will find word embeddings to be very useful. TF-IDF is useful when the problem needs to be solved emphasising the words with lesser frequency. Word embeddings, on the other hand, convert the text to a N-dimensional vector which may show up similarity with some other vector. This could bring a sense of association in your data and the model can learn the best features possible.
In simple cases, we can use a bag of words representation to tokenize the texts.
So, you need to discover the best approach for your problem. If you are solving a problems which closely resembles the famous NLP problems like IMDB review classification, sentiment analysis on Twitter data, then you can find a number of approaches on the internet.
I am planning to get some review data from tripadvisor and I want to be able to extract hotel related aspects and assign polarity to them and classify them as negative or positive.
What tools can I use for this purpose and how and where do I start? I know there are some tools like GATE, Stanford NLP, Open NLP etc, but would I be able to perform the above specific tasks? If so, please let me know an approach to go forward. I am planning to use Java as the choice of programming language and would like to use some APIs
Also, should I go ahead with a rule based approach or a ML approach that uses a trained corpus of reviews, so some other approach completely?
P.S : I am new to NLP and I need some help to go forward.
Stanford CoreNLP has lot of features in one package
POS Tagger
NER Model
Sentiment Analysis
Parser
But in Apache OpenNLP package consist
Sentence Detector
POS tagger
NER
Chunker
But they don't have built in feature to find out Sentiment polarity So you have to pass your tags to other libraries such like SentiwordNet to find out the polarity.
I used used OpenNLP and Stanford Core NLP. But for both you need to modify sentiment corpus with respect to restaurant domain.
You can try ConceptNet (http://conceptnet5.media.mit.edu/). See for instance here (at the bottom of the page): https://github.com/commonsense/conceptnet5/wiki/API how to "see 20 things in English with the most positive affect:"
So, this question might be a little naive, but I thought asking the friendly people of Stackoverflow wouldn't hurt.
My current company has been using a third party API for NLP for a while now. We basically URL encode a string and send it over, and they extract certain entities for us (we have a list of entities that we're looking for) and return a json mapping of entity : sentiment. We've recently decided to bring this project in house instead.
I've been studying NLTK, Stanford NLP and lingpipe for the past 2 days now, and can't figure out if I'm basically reinventing the wheel doing this project.
We already have massive tables containing the original unstructured text and another table containing the extracted entities from that text and their sentiment. The entities are single words. For example:
Unstructured text : Now for the bed. It wasn't the best.
Entity : Bed
Sentiment : Negative
I believe that implies we have training data (unstructured text) as well as entity and sentiments. Now how I can go about using this training data on one of the NLP frameworks and getting what we want? No clue. I've sort of got the steps, but not sure:
Tokenize sentences
Tokenize words
Find the noun in the sentence (POS tagging)
Find the sentiment of that sentence.
But that should fail for the case I mentioned above since it talks about the bed in 2 different sentences?
So the question - Does any one know what the best framework would be for accomplishing the above tasks, and any tutorials on the same (Note: I'm not asking for a solution). If you've done this stuff before, is this task too large to take on? I've looked up some commercial APIs but they're absurdly expensive to use (we're a tiny startup).
Thanks stackoverflow!
OpenNLP may also library to look at. At least they have a small tutuorial to train the name finder and to use the document categorizer to do sentiment analysis. To trtain the name finder you have to prepare training data by taging the entities in your text with SGML tags.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
NLTK provides a naive NER tagger along with resources. But It doesnt fit into all cases (including finding dates.) But NLTK allows you to modify and customize the NER Tagger according to the requirement. This link might give you some ideas with basic examples on how to customize. Also if you are comfortable with scala and functional programming this is one tool you cannot afford to miss.
Cheers...!
I have discovered spaCy lately and it's just great ! In the link you can find comparative for performance in term of speed and accuracy compared to NLTK, CoreNLP and it does really well !
Though to solve your problem task is not a matter of a framework. You can have two different system, one for NER and one for Sentiment and they can be completely independent. The hype these days is to use neural network and if you are willing too, you can train a recurrent neural network (which has showed best performance for NLP tasks) with attention mechanism to find the entity and the sentiment too.
There are great demo everywhere on the internet, the last two I have read and found interesting are [1] and [2].
Similar to Spacy, TextBlob is another fast and easy package that can accomplish many of these tasks.
I use NLTK, Spacy, and Textblob frequently. If the corpus is simple, generic, and straightforward, Spacy and Textblob work well OOTB. If the corpus is highly customized, domain-specific, messy (incorrect spelling or grammar), etc. I'll use NLTK and spend more time customizing my NLP text processing pipeline with scrubbing, lemmatizing, etc.
NLTK Tutorial: http://www.nltk.org/book/
Spacy Quickstart: https://spacy.io/usage/
Textblob Quickstart: http://textblob.readthedocs.io/en/dev/quickstart.html
Is there a way to check for sentiments in the text? I am trying to build a chat client which should analyse the text and able to determine the mood of the user
You can try using classification, such as described in the data mining scientific field. You can for example train a support-vector machine (SVM) and use that for classification. Here are two papers I found using this approach on blog text:
Understanding how bloggers feel: recognizing affect in blog posts
A Hybrid Mood Classification Approach for Blog Text
And here is a classification approach using features known from psychology which doesn't require training of a model:
Mood sensing from social media texts and its applications
Is anyone out there used Rapidminer for sentiment analysis... Is this a right combination???
If not how do I get started with a simple sentiment analysis application??
RapidMiner is a very powerful text mining and sentiment analysis tools. I can recommend the RapidMiner training courses offered by Rapid-I. They gave me a really quick start. They also offer a dedicated course on text mining and sentiment analysis:
Sentiment Analysis, Opinion Mining, and Automated Market Research .
Starting in September or October 2009, they will also offer webinars. You should contact them directly, if you would like to learn more about their webinars. Several major online market research companies in Europe and the US are using RapidMiner for opinion mining and sentiment analysis from internet discussions groups and web blogs. For more details and references I would again suggest to simply ask their team at contact(at)rapid-i.com or check their RapidMiner forum at forum.rapid-i.com .
Best regards,
Frank
This series of videos should help:
http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-loading.html
When I go to rapid miner site it is confusing me.
http://rapidminer.com/solutions/sentiment-analysis/
"It looks like a crowd sourcing to identify the polarity of product reviews and discussions around the web." If you are looking to automate in real time this might not be a good solution.
spotdy.com offers free NLP for developers. It works pretty cool.
Most of the Sentiment Analysis software tokenize words and giving a positive and negative factor and sum those up. Since language is contextual, this leads to ignoring the context which is not a right way to do.
Instead deep learning models, HMM based on sentence structure. It computes the sentiment based on how words are composed in a sentence. Check out spotdy.com. It is free.