I am trying to extract collocations using nltk from a corpus an then use their occurrences as features for a scikit-learn classifier.
Unfortunately I am not so familiar with nltk and I don't see an easy way to do this.
I got this far:
extract collocations using BigramCollocationFinder from corpus
for each document, extract all bigrams (using nltk.bigrams) and check if they are one of the collocations
create a TfidfVectorizer with an analyzer that does nothing
feed it the documents in form of the extracted bigrams
That seems pretty overcomplicated to me. Also it has the problem that the BigramCollactionFinder has a window_size parameter for bigrams that span over words. The standard nltk.bigrams extraction can not do that.
A way to overcome this would be to instantiate a new BigramCollocationFinder for each document and extract bigrams again and match them against the ones I found before... but again, that seems way to complicated.
Surely there is an easier way to do that, that I overlook.
Thanks for your suggestions!
larsmans has already contributed a NLTK / scikit-learn feature mapper for simple, non collocation features. That might give you some inspiration for your own problem:
http://nltk.org/_modules/nltk/classify/scikitlearn.html
Related
I am using en_core_web_lg to compare some texts for similarity and I am not getting the expected results.
The issue I guess is that my texts are mostly religious, for example:
"Thus hath it been decreed by Him Who is the Source of Divine inspiration."
"He, verily, is the Expounder, the Wise."
"Whoso layeth claim to a Revelation direct from God, ere the expiration of a full thousand years, such a man is assuredly a lying impostor. "
My question is, is there a way I can check spacy's "dictionary"? Does it include words like "whoso" "layeth" "decreed" or "verily"?
To check if spaCy knows about individual words you can check tok.is_oov ("is out of vocabulary"), where tok is a token from a doc.
spaCy is trained on a dataset called OntoNotes. While that does include some older texts, like the bible, it's mostly relatively recent newspapers and similar sources. The word vectors are trained on Internet text. I would not expect it to work well with documents of the type you are describing, which are very different from what it has seen before.
I would suggest you train custom word vectors on your dataset, which you can then load into spaCy. You could also look at the HistWords project.
For my Bachelorthesis I need to train different word embedding algorithms on the same corpus to benchmark them.
I am looking to find preprocessing steps but am not sure which ones to use and which ones might be less useful.
I already looked for some studies but also wanted to ask if someone has experience with this.
My objective is to train Word2Vec, FastText and GloVe Embeddings on the same corpus. Not too sure which one now, but I think of Wikipedia or something similar.
In my opinion:
POS-Tagging
remove non-alphabetic characters with regex or similar
Stopword removal
Lemmatization
catching Phrases
are the logical options.
But I heard that stopword removal can be kind of tricky, because there is a chance that some embeddings still contain stopwords due to the fact that automatic stopword removal might not fit to any model/corpus.
Also I have not decided if I want to choose spacy or nltk as library, spacy is mightier but nltk is mainly used at the chair I am writing.
Preprocessing is like hyperparameter optimization or neural architecture search. There isn't a theoretical answer to "which one should I use". The applied section of this field (NLP) is far ahead of the theory. You just run different combinations until you find the one that works best (according to your choice of metric).
Yes Wikipedia is great, and almost everyone uses it (plus other datasets). I've tried spacy and it's powerful, but I think I made a mistake with it and I ended up writing my own tokenizer which worked better. YMMV. Again, you just have to jump in and try almost everything. Check with your advisor that you have enough time and computing resources.
I'm kinda newbie and not native english so have some trouble understanding Gensim's word2vec and doc2vec.
I think both give me some words most similar with query word I request, by most_similar()(after training).
How can tell which case I have to use word2vec or doc2vec?
Someone could explain difference in short word, please?
Thanks.
In word2vec, you train to find word vectors and then run similarity queries between words. In doc2vec, you tag your text and you also get tag vectors. For instance, you have different documents from different authors and use authors as tags on documents. Then, after doc2vec training you can use the same vector aritmetics to run similarity queries on author tags: i.e who are the most similar authors to AUTHOR_X? If two authors generally use the same words then their vector will be closer. AUTHOR_X is not a real word which is part of your corpus just something you determine. So you don't need to have it or manually insert it into your text. Gensim allows you to train doc2vec with or without word vectors (i.e. if you only care about tag similarities between each other).
Here is a good presentation on word2vec basics and how they use doc2vec in an innovative way for product recommendations (related blog post).
If you tell me about what problem you are trying to solve, may be I can suggest which method will be more appropriate.
I have been trying to use NER feature of NLTK. I want to extract such entities from the articles. I know that it can not be perfect in doing so but I wonder if there is human intervention in between to manually tag NEs, will it improve?
If yes, is it possible with present model in NLTK to continually train the model. (Semi-Supervised Training)
The plain vanilla NER chunker provided in nltk internally uses maximum entropy chunker trained on the ACE corpus. Hence it is not possible to identify dates or time, unless you train it with your own classifier and data(which is quite a meticulous job).
You could refer this link for performing he same.
Also, there is a module called timex in nltk_contrib which might help you with your needs.
If you are interested to perform the same in Java better look into Stanford SUTime, it is a part of Stanford CoreNLP.
So, this question might be a little naive, but I thought asking the friendly people of Stackoverflow wouldn't hurt.
My current company has been using a third party API for NLP for a while now. We basically URL encode a string and send it over, and they extract certain entities for us (we have a list of entities that we're looking for) and return a json mapping of entity : sentiment. We've recently decided to bring this project in house instead.
I've been studying NLTK, Stanford NLP and lingpipe for the past 2 days now, and can't figure out if I'm basically reinventing the wheel doing this project.
We already have massive tables containing the original unstructured text and another table containing the extracted entities from that text and their sentiment. The entities are single words. For example:
Unstructured text : Now for the bed. It wasn't the best.
Entity : Bed
Sentiment : Negative
I believe that implies we have training data (unstructured text) as well as entity and sentiments. Now how I can go about using this training data on one of the NLP frameworks and getting what we want? No clue. I've sort of got the steps, but not sure:
Tokenize sentences
Tokenize words
Find the noun in the sentence (POS tagging)
Find the sentiment of that sentence.
But that should fail for the case I mentioned above since it talks about the bed in 2 different sentences?
So the question - Does any one know what the best framework would be for accomplishing the above tasks, and any tutorials on the same (Note: I'm not asking for a solution). If you've done this stuff before, is this task too large to take on? I've looked up some commercial APIs but they're absurdly expensive to use (we're a tiny startup).
Thanks stackoverflow!
OpenNLP may also library to look at. At least they have a small tutuorial to train the name finder and to use the document categorizer to do sentiment analysis. To trtain the name finder you have to prepare training data by taging the entities in your text with SGML tags.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
NLTK provides a naive NER tagger along with resources. But It doesnt fit into all cases (including finding dates.) But NLTK allows you to modify and customize the NER Tagger according to the requirement. This link might give you some ideas with basic examples on how to customize. Also if you are comfortable with scala and functional programming this is one tool you cannot afford to miss.
Cheers...!
I have discovered spaCy lately and it's just great ! In the link you can find comparative for performance in term of speed and accuracy compared to NLTK, CoreNLP and it does really well !
Though to solve your problem task is not a matter of a framework. You can have two different system, one for NER and one for Sentiment and they can be completely independent. The hype these days is to use neural network and if you are willing too, you can train a recurrent neural network (which has showed best performance for NLP tasks) with attention mechanism to find the entity and the sentiment too.
There are great demo everywhere on the internet, the last two I have read and found interesting are [1] and [2].
Similar to Spacy, TextBlob is another fast and easy package that can accomplish many of these tasks.
I use NLTK, Spacy, and Textblob frequently. If the corpus is simple, generic, and straightforward, Spacy and Textblob work well OOTB. If the corpus is highly customized, domain-specific, messy (incorrect spelling or grammar), etc. I'll use NLTK and spend more time customizing my NLP text processing pipeline with scrubbing, lemmatizing, etc.
NLTK Tutorial: http://www.nltk.org/book/
Spacy Quickstart: https://spacy.io/usage/
Textblob Quickstart: http://textblob.readthedocs.io/en/dev/quickstart.html