NLP: Arrange words with tags into proper English sentence? - nlp

lets say I have a sentence:
"you hello how are ?"
I get output of:
you_PRP hello_VBP how_WRB are_VBP
What is best way to arrange the wording into proper English sentence like: Hello how are you ?
I am new to this whole natural language processing so I am unfamiliar with many terms.
The only way I can think of on top of my head is - Using statements to determine:
adverb - verb - noun and then re-arrange them based on that?
Note: Lets assume I am trying to form proper question, so ignore determining if it's a question or a statement.

You should look into language models. A bigram language model, for example, will give you the probability of observing a sentence on the basis of the two-word sequences in that sentence. On the basis of a corpus of texts, it will have learned that "how are" has a higher probability of occurring than "are how". If you multiply the probabilities of all these two-word sequences in a sentence, you will get the probability of the sentence.
In other words, this is how you can solve your problem:
Find a corpus (either a simple text corpus, or a corpus that has been tagged with part-of-speech tags).
Learn a language model from that corpus. You can do this simply on the basis of the words, or on the basis of the words and their part-of-speech tags, as in your example.
Generate all possible sequences of your target words.
Use the language model to compute the probabilities of all those sequences.
Pick the sequence with the highest probability.
If you work with Python, nltk has an api for training and using language models. Otherwise, KenLM is a popular language modelling package.

Related

How to solve difficult sentences for nlp sentiment analysis

Such as the following sentence,
"Don't pay attention to people if they say it's no good."
As humans, we understand the overall sentiment from the sentence is positive.
Technique of "Bag of Words" or BOW
Then, we have the two categories of "positive" words as Polarity of 1, "negative" words of Polarity of 0.
In this case, the word of "good" fits into category, but here it is accidentally correct.
Thus, this technique is eliminated.
Still use BOW technique (sort of "Word Embedding")
But take into consideration of its surrounding words, in this case, the "no" word preceding it, thus, it's "no good", not the adj alone "good". However, "no good" is not what the author intended from the context of the entire sentence.
Thus, this question. Thanks in advance.
Word embeddings are one possible way to try to take into account the complexity coming from the sequence of terms in your example. Using pre-trained models on general English such as BERT should give you interesting results for your sentiment analysis problem. You can leverage on several implementation provided by Hugging face library.
Another approach, that doesn't rely on compute intensive techniques (such as word embeddings), would be to use n-gram which will capture the sequence aspect and should provide good features for sentiment estimation. You can try different depth (unigram, bigrams, trigrams...) and combine with different types of preprocesing and/or tokenizers. Scikit-learn provides a good reference implementation for n-gramss in its CountVectorizer class.

Dataset Language identification

I am working on a text classification problem with a multilingual dataset. I would like to know how the languages are distributed in my dataset and what languages are these. The number of languages might be approximately 8-12. I am considering this language detection as a part of the preprocessing. I would like to figure out the languages in order to be able to use the appropriate stop words and see how less data in some of the given languages could affect the occuracy of the classificatin.
Is langid.py or simple langdetect suitable? or any other suggestions?
Thanks
The easiest way to identify the language of a text is to have a list of common grammatical words of each language (pretty much your stop words, in fact), take a sample of the text and count which words occur in your (language-specific) word lists. Then sum them up and the word list with the largest overlap should be the language of the text.
If you want to be more advanced, you can use n-grams instead of words: collect n-grams from a text you know the language of, and use that as a classifier instead of your stop words.
You could use any transformer-based model trained on multiple languages. For instance, you could use XLM-Roberta which is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang tensors to understand which language is used (which is good in your case), and should be able to determine the correct language from the input ids. Besides like any other transformer based model, it comes with its tokenizer so you could jump the preprocessing part.
You could use the Huggingface library to use any of these models.
Check the XLM Roberta Huggingface documentation here

Classify words with the same meaning

I have 50.000 subject lines from emails and i want to classify the words in them based on synonyms or words that can be used instead of others.
For example:
Top sales!
Best sales
I want them to be in the same group.
I build the following function with nltk's wordnet but it doesn't work well.
def synonyms(w,group,guide):
try:
# Check if the words is similar
w1 = wordnet.synset(w +'.'+guide+'.01')
w2 = wordnet.synset(group +'.'+guide+'.01')
if w1.wup_similarity(w2)>=0.7:
return True
elif w1.wup_similarity(w2)<0.7:
return False
except:
return False
Any ideas or tools to accomplish this?
The easiest way to accomplish this would be to compare the similarity of the respective word embeddings (the most common implementation of this is Word2Vec).
Word2Vec is a way of representing the semantic meaning of a token in a vector space, which enables the meanings of words to be compared without requiring a large dictionary/thesaurus like WordNet.
One problem with regular implementations of Word2Vec is that it does differentiate between different senses of the same word. For example, the word bank would have the same Word2Vec representation in all of these sentences:
The river bank was dry.
The bank loaned money to me.
The plane may bank to the left.
Bank has the same vector in each of these cases, but you may want them to be sorted into different groups.
One way to solve this is to use a Sense2Vec implementation. Sense2Vec models take into account the context and part of speech (and potentially other features) of the token, allowing you to differentiate between the meanings of different senses of the word.
A great library for this in Python is Spacy. It is like NLTK, but much faster as it is written in Cython (20x faster for tokenization and 400x faster for tagging). It also has Sense2Vec embeddings inbuilt, so you can accomplish your similarity task without needing other libraries.
It's as simple as:
import spacy
nlp = spacy.load('en')
apples, and_, oranges = nlp(u'apples and oranges')
apples.similarity(oranges)
It's free and has a liberal license!
An idea is to solve this with embeddings and word2vec , the outcome will be a mapping from words to vectors which are "near" when they have similar meanings, for example "car" and "vehicle" will be near and "car" and "food" will not, you can then measure the vector distance between 2 words and define a threshold to select if they are so near that they mean the same, as i said its just an idea of word2vec
The computation behind what Nick said is to calculate the distance (cosine distance) between two phrases vectors.
Top sales!
Best sales
Here is one way to do so: How to calculate phrase similarity between phrases

Causal Sentences Extraction Using NLTK python

I am extracting causal sentences from the accident reports on water. I am using NLTK as a tool here. I manually created my regExp grammar by taking 20 causal sentence structures [see examples below]. The constructed grammar is of the type
grammar = r'''Cause: {<DT|IN|JJ>?<NN.*|PRP|EX><VBD><NN.*|PRP|VBD>?<.*>+<VBD|VBN>?<.*>+}'''
Now the grammar has 100% recall on the test set ( I built my own toy dataset with 50 causal and 50 non causal sentences) but a low precision. I would like to ask about:
How to train NLTK to build the regexp grammar automatically for
extracting particular type of sentences.
Has any one ever tried to extract causal sentences. Example
causal sentences are:
There was poor sanitation in the village, as a consequence, she had
health problems.
The water was impure in her village, For this reason, she suffered
from parasites.
She had health problems because of poor sanitation in the village.
I would want to extract only the above type of sentences from a
large text.
Had a brief discussion with the author of the book: "Python Text Processing with NLTK 2.0 Cookbook", Mr.Jacob Perkins. He said, "a generalized grammar for sentences is pretty hard. I would instead see if you can find common tag patterns, and use those. But then you're essentially do classification by regexp matching. Parsing is usually used to extract phrases within a sentence, or to produce deep parse trees of a sentence, but you're just trying to identify/extract sentences, which is why I think classification is a much better approach. Consider including tagged words as features when you try this, since the grammar could be significant." taking his suggestions I looked at the causal sentences I had and I found out that these sentences have words like
consequently
as a result
Therefore
as a consequence
For this reason
For all these reasons
Thus
because
since
because of
on account of
due to
for the reason
so, that
These words are indeed connecting cause and effect in a sentence. Using these connectors it is now easy to extract causal sentences. A detailed report can be found on arxiv: https://arxiv.org/pdf/1507.02447.pdf

Identifying the entity in sentiment analysis using Lingpipe

I have implemented sentiment analysis using the sentiment analysis module of Lingpipe. I know that they use a Dynamic LR model for this. It just tells me if the test string is a positive sentiment or negative sentiment. What ideas could I use to determine the object for which the sentiment has been expressed?
If the text is categorized as positive sentiment, I would like to get the object for which the sentiment has been expressed - this could be a movie name, product name or others.
Although this question is really old but I would like to answer it for others' benefit.
What you want here is concept level sentiment analysis. For a very basic version, I would recommend following these steps:
Apply sentence splitter. You can either use Lingpipe's Sentence Splitter or the OpenNLP Sentence Detector.
Apply part-of-spech tagging. Again you can either use Lingpipe's POS tagger or OpenNLP POS Tagger.
You then need to identify tokens(s) identified as 'Nouns' by the POS tagger. These token(s) have the potential of being the targeted entity in the sentence.
Then you need to find sentiment words in the sentence. The easiest way to do this is by using a dictionary of sentiment bearing words. You can find many such dictionaries online.
The next step will be find out dependency relations in sentences. This can be achieved by using the Stanford Dependency Parser. For example, if you try out the sentence - "This phone is good." in their online demo, you can see the following 'Typed Dependencies':
det(phone-2, This-1),
nsubj(good-4, phone-2),
cop(good-4, is-3),
root(ROOT-0, good-4)
The dependency nsubj(good-4, phone-2) here indicates that phone is the nominal subject of the token good, implying that the word good is expressed for phone. I am sure that your sentiment dictionary will contain the word good and phone would have been identified as a noun by the POS tagger. Thus, you can conclude that the sentiment good was expressed for the entity phone.
This was a very basic example. You can go a step further and create rules around the dependency relations to extract more complex sentiment-entity pairs. You can also assign scores to your sentiment terms and come up with a total score for the sentence depending upon the number of occurrences of sentiment words in that sentence.
Usually sentiment sentence means that the main entity of such sentence is the object of that sentiment. So basic heuristic is to NER and get first object. Otherwise you should use deep parsing NLP toolkits and write some rules to link sentiment to object.

Resources