Text processing tool for tweets - nlp

I am collecting millions of sports related tweets daily. I want to process the text in those tweets. I want to recognize the entities, find the sentiment of the sentence and find the events in those tweets.
Entity recognizing :
For example :
"Rooney will play for England in their next match".
From this tweet i want to recognize person entity "Rooney" and place entity "England"
sentiment analysis:
I want to find the sentiment of a sentence. For example
Chelsea played their worst game ever
Ronaldo scored a beautiful goal
The first one should marked as "negative" sentence and the later one should marked as "positive".
Event recognizing :
I want to find "goal scoring event" from tweets. Sentences like "messi scored goal in first half" and "that was a fantastic goal from gerrald" should marked as "goal scoring event".
I know entity recognizing and sentiment analysis tools are available and i need to write the rules for event recognizing. I have seen so many tools like Stanford NER, alchemy api, open calais, meaning cloud api, ling pipe, illinois etc..
I'm really confused about which tool I should select? Is there any free tools available without daily rate limits? I want to process millions of tweets daily and java is my preferable language.
Thanks.

For NER you can also use TwitIE which is a GATE pipeline so you can use it using the GATE API in Java.

Given the consideration that your preferred language is Java, I would strongly suggest to start with Stanford NLP project. Most of your basic needs like cleansing, chunking, NER can be done based on that. For NER click here.
Going ahead for sentiment analysis you can use simplistic classifiers like Naive Bayes and then add complexities. More here.
For the event extraction, you can use linguistic approach to identify the verbs with their association with ontology on your side.
Just remember, this is just to get you started and no way an extensive answer.

No API with unlimited call availalble. IF you want to stick with java, use stanford package with customization as per your need.
If you are comfortable with python, look at nltk.
Well, for person, organization stanford will work, for your input query :
Rooney will play for England in their next match
[Text=Rooney CharacterOffsetBegin=0 CharacterOffsetEnd=6 PartOfSpeech=NNP Lemma=Rooney NamedEntityTag=PERSON] [Text=will CharacterOffsetBegin=7 CharacterOffsetEnd=11 PartOfSpeech=MD Lemma=will NamedEntityTag=O] [Text=play CharacterOffsetBegin=12 CharacterOffsetEnd=16 PartOfSpeech=VB Lemma=play NamedEntityTag=O] [Text=for CharacterOffsetBegin=17 CharacterOffsetEnd=20 PartOfSpeech=IN Lemma=for NamedEntityTag=O] [Text=England CharacterOffsetBegin=21 CharacterOffsetEnd=28 PartOfSpeech=NNP Lemma=England NamedEntityTag=LOCATION] [Text=in CharacterOffsetBegin=29 CharacterOffsetEnd=31 PartOfSpeech=IN Lemma=in NamedEntityTag=O] [Text=their CharacterOffsetBegin=32 CharacterOffsetEnd=37 PartOfSpeech=PRP$ Lemma=they NamedEntityTag=O] [Text=next CharacterOffsetBegin=38 CharacterOffsetEnd=42 PartOfSpeech=JJ Lemma=next NamedEntityTag=O] [Text=match CharacterOffsetBegin=43 CharacterOffsetEnd=48 PartOfSpeech=NN Lemma=match NamedEntityTag=O]
If you want to add eventrecognization too, you need to retrain the stanford package with extrac class having event based dataset. Which can help you to classify event based input.
Does the NER use part-of-speech tags?
None of our current models use pos tags by default. This is largely
because the features used by the Stanford POS tagger are very similar
to those used in the NER system, so there is very little benefit to
using POS tags.
However, it certainly is possible to train new models which do use POS
tags. The training data would need to have an extra column with the
tag information, and you would then add tag=X to the map parameter.
check - http://nlp.stanford.edu/software/crf-faq.shtml

Stanford NER and OPENNLP are both open-source and have models that perform well on formal article/texts.
But their accuracy drops significantly over Twitter (from 90% recall over formal text to 40% recall over tweets)
The informal nature of tweets (bad capitalization, spellings, punctuations), improper usage of words, vernacularity and emoticons makes it more complicated
NER, sentiment analysis and event extraction over tweets is a well-researched area apparently for its applications.
Take a look at this: https://github.com/aritter/twitter_nlp, see this demo of twitter NLP and event extraction: http://ec2-54-170-89-29.eu-west-1.compute.amazonaws.com:8000/
Thank you

Related

dataset to use for question formation from any text

I am trying to create an improved quiz generator that accepts a certain text as an input and forms questions from the sentences. I want to create a machine learning model that splits the sentence into different parts so it is capable of forming different questions from the same sentence. For example: from the sentence "Amazon river is the longest river in South America." should form questions: What is the longest river in South America? Is Amazon river the longest river in South America? Where is Amazon river located? etc. If possible, I would also like it to get the context from multiple sentences and then form one question from multiple sentence information. I want it to be able to perform well on any text, not just specific topic. How should I make my dataset or which dataset should I use?
I don't have a lot of previous knowledge on the topic, so I was thinking of somehow using nltk.pos_tag() which specifies everyword in a sentence. I am just not sure how to use it in my model and dataset.
What you're attempting to do is non-trivial and is related to the task of Automatic Question Generation (AQG) which looks at converting structured or unstructured declarative natural language sentences into valid interrogative forms. Various automated linguistic (rules-based) and statistical methods have been employed. I'd recommend reading [1] by Blšták & Rozinajová, particularly Section 2 which summarises some of the datasets and methods available. The survey by Lu & Lu [2] provides a recent overview of the field. It seems like the most common approach is to leverage existing QA datasets (e.g. SQuAD, HotpotQA et cetera, see Table 5 of [2]). In terms of more practical, quick ways to get started without having to train your own ML/DL model, you could use existing Transformer-based models from HuggingFace such as iarfmoose/t5-base-question-generator available here which takes concatenated answers and context as an input sequence, e.g.:
<answer> answer text here <context> context text here
and will generate a full question (interrogative) sentence as an output sequence. According to the author, it is recommended that a large number of sequences be generated and then filtered with iarfmoose/bert-base-cased-qa-evaluator.
References
[1] Blšták, M. and Rozinajová, V., 2022. Automatic question generation based on sentence structure analysis using machine learning approach. Natural Language Engineering, 28(4), pp.487-517.
[2] Lu, C.Y. and Lu, S.E., 2021, October. A Survey of Approaches to Automatic Question Generation: from 2019 to Early 2021. In Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021) (pp. 151-162).

Named Entity Recognition Systems for German Texts

I am working on a Named Entity Recognition (NER) project in which I got a large amount of text in the sense that it is too much to read or skim read. Therefore, I want to create an overview of what is mentioned by extracting named entities (places, names, times, maybe topics) and create an index of kind (entity, list of pages/lines where it is mentioned). I have worked through Standford's NLP lecture, (parts of) Eisenstein's Introduction to NLP book found some literature and systems for English texts. As my corpus is in German, I would like to ask how I can approach this problem. Also, this is my first NLP project, so I would not know if I could solve this challenge even if texts were in English.
As a first step
are there German NER systems out there which I could use?
The further roadmap of my project is:
How can I avoid mapping misspellings or rare names to a NUL/UNK token? This is relevant because there are also some historic passages that use words no longer in use or that follow old orthography. I think the relevant terms are tokenisation or stemming.
I thought about fine-tuning or transfer learning the base NER model to a corpus of historic texts to improve NER.
A major challenge is that there is no annotated dataset for my corpus available and I could only manually annotate a tiny fraction of it. So I would be happy for hints on German annotated datasets which I could incorporate into my project.
Thank you in advance for your inputs and fruitful discussions.
Most good NLP toolkits can perform NER in German:
Stanford NLP
Spacy
probably NLTK and OpenNLP as well
What is crucial to understand is that using NER software like the above means using a pretrained model, i.e. a model which has been previously trained on some standard corpus with standard annotated entities.
Btw you can usually find the original annotated dataset by looking at the documentation. There's one NER corpus here.
This is convenient and might suit your goal, but sometimes it doesn't collect exactly every that you would like it to collect, especially if your corpus is from a very specific domain. If you need more specific NER, you must train your own model and this requires obtaining some annotated data (i.e. manually annotating or paying somebody to do it).
Even in this case, a NER model is statistical and it will unavoidably make some mistakes, don't expect perfect results.
About misspellings or rare names: a NER model doesn't care (or not too much) about the actual entity, because it's not primarily based on the words in the entity. It's based on indications in the surrounding text, for example in the sentence "It was announced by Mr XYZ that the event would take place in July", the NER model should find 'Mr XYZ' as a person due to "announced by" and 'July' as a date because of "take place in". However if the language used in the corpus is very different from the training data used for the model, the performance could be very bad.

Entity Recognition and Sentiment Analysis using NLP

So, this question might be a little naive, but I thought asking the friendly people of Stackoverflow wouldn't hurt.
My current company has been using a third party API for NLP for a while now. We basically URL encode a string and send it over, and they extract certain entities for us (we have a list of entities that we're looking for) and return a json mapping of entity : sentiment. We've recently decided to bring this project in house instead.
I've been studying NLTK, Stanford NLP and lingpipe for the past 2 days now, and can't figure out if I'm basically reinventing the wheel doing this project.
We already have massive tables containing the original unstructured text and another table containing the extracted entities from that text and their sentiment. The entities are single words. For example:
Unstructured text : Now for the bed. It wasn't the best.
Entity : Bed
Sentiment : Negative
I believe that implies we have training data (unstructured text) as well as entity and sentiments. Now how I can go about using this training data on one of the NLP frameworks and getting what we want? No clue. I've sort of got the steps, but not sure:
Tokenize sentences
Tokenize words
Find the noun in the sentence (POS tagging)
Find the sentiment of that sentence.
But that should fail for the case I mentioned above since it talks about the bed in 2 different sentences?
So the question - Does any one know what the best framework would be for accomplishing the above tasks, and any tutorials on the same (Note: I'm not asking for a solution). If you've done this stuff before, is this task too large to take on? I've looked up some commercial APIs but they're absurdly expensive to use (we're a tiny startup).
Thanks stackoverflow!
OpenNLP may also library to look at. At least they have a small tutuorial to train the name finder and to use the document categorizer to do sentiment analysis. To trtain the name finder you have to prepare training data by taging the entities in your text with SGML tags.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training
NLTK provides a naive NER tagger along with resources. But It doesnt fit into all cases (including finding dates.) But NLTK allows you to modify and customize the NER Tagger according to the requirement. This link might give you some ideas with basic examples on how to customize. Also if you are comfortable with scala and functional programming this is one tool you cannot afford to miss.
Cheers...!
I have discovered spaCy lately and it's just great ! In the link you can find comparative for performance in term of speed and accuracy compared to NLTK, CoreNLP and it does really well !
Though to solve your problem task is not a matter of a framework. You can have two different system, one for NER and one for Sentiment and they can be completely independent. The hype these days is to use neural network and if you are willing too, you can train a recurrent neural network (which has showed best performance for NLP tasks) with attention mechanism to find the entity and the sentiment too.
There are great demo everywhere on the internet, the last two I have read and found interesting are [1] and [2].
Similar to Spacy, TextBlob is another fast and easy package that can accomplish many of these tasks.
I use NLTK, Spacy, and Textblob frequently. If the corpus is simple, generic, and straightforward, Spacy and Textblob work well OOTB. If the corpus is highly customized, domain-specific, messy (incorrect spelling or grammar), etc. I'll use NLTK and spend more time customizing my NLP text processing pipeline with scrubbing, lemmatizing, etc.
NLTK Tutorial: http://www.nltk.org/book/
Spacy Quickstart: https://spacy.io/usage/
Textblob Quickstart: http://textblob.readthedocs.io/en/dev/quickstart.html

Short text classification

I am about to start a project where my final goal is to classify short texts into classes: "may be interested in visiting place X" : "not interested or neutral". Place is described by set of keywords (e.g. meals or types of miles like "chinese food"). So ideally I need some approach to model desire of user based on short text analysis - and then classify based on a desire score or desire probability - is there any state-of-the-art in this field ? Thank you
This problem is exactly the same as sentiment analysis of texts. But, instead of the traditional binary classification, you seem to have a "neutral" opinion. State-of-the-art in sentiment analysis is highly domain-dependent. Techniques that have excelled in classifying movies do not perform as well on commercial products, for example.
Additionally, even the feature-selection is highly domain-dependent. For example, unigrams work well for movie review classification, but a combination of unigrams and bigrams perform better for classifying twitter texts.
My best advice is to "play around" with different features. Since you are looking at short texts, twitter is probably a good motivational example. I would start with unigrams and bigrams as my features. The exact algorithm is not very important. SVM usually performs very well with correct parameter tuning. Use a small amount of held-out data for tuning these parameters before experimenting on bigger datasets.
The more interesting portion of this problem is the ranking! A "purity score" has been recently used for this purpose in the following papers (and I'd say they are pretty state-of-the-art):
Sentiment summarization: evaluating and learning user preferences. Lerman, Blair-Goldensohn and McDonald. EACL. 2009.
The viability of web-derived polarity lexicons. Velikovich, Blair-Goldensohn, Hannan and McDonald. NAACL. 2010.

Sentiment analysis with NLTK python for sentences using sample data or webservice?

I am embarking upon a NLP project for sentiment analysis.
I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task.
Here is my task:
I start with one long piece of data (lets say several hundred tweets on the subject of the UK election from their webservice)
I would like to break this up into sentences (or info no longer than 100 or so chars) (I guess i can just do this in python??)
Then to search through all the sentences for specific instances within that sentence e.g. "David Cameron"
Then I would like to check for positive/negative sentiment in each sentence and count them accordingly
NB: I am not really worried too much about accuracy because my data sets are large and also not worried too much about sarcasm.
Here are the troubles I am having:
All the data sets I can find e.g. the corpus movie review data that comes with NLTK arent in webservice format. It looks like this has had some processing done already. As far as I can see the processing (by stanford) was done with WEKA. Is it not possible for NLTK to do all this on its own? Here all the data sets have already been organised into positive/negative already e.g. polarity dataset http://www.cs.cornell.edu/People/pabo/movie-review-data/ How is this done? (to organise the sentences by sentiment, is it definitely WEKA? or something else?)
I am not sure I understand why WEKA and NLTK would be used together. Seems like they do much the same thing. If im processing the data with WEKA first to find sentiment why would I need NLTK? Is it possible to explain why this might be necessary?
I have found a few scripts that get somewhat near this task, but all are using the same pre-processed data. Is it not possible to process this data myself to find sentiment in sentences rather than using the data samples given in the link?
Any help is much appreciated and will save me much hair!
Cheers Ke
The movie review data has already been marked by humans as being positive or negative (the person who made the review gave the movie a rating which is used to determine polarity). These gold standard labels allow you to train a classifier, which you could then use for other movie reviews. You could train a classifier in NLTK with that data, but applying the results to election tweets might be less accurate than randomly guessing positive or negative. Alternatively, you can go through and label a few thousand tweets yourself as positive or negative and use this as your training set.
For a description of using Naive Bayes for sentiment analysis with NLTK: http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
Then in that code, instead of using the movie corpus, use your own data to calculate word counts (in the word_feats method).
Why dont you use WSD. Use Disambiguation tool to find senses. and use map polarity to the senses instead of word. In this case you will get a bit more accurate results as compared to word index polarity.

Resources