Handling compound words (2-grams) using NLTK - nlp

I'm trying to identify user similarities by comparing the keywords used in their profile (from a website). For example, Alice = pizza, music, movies, Bob = cooking, guitar, movie and Eve = knitting, running, gym. Ideally, Alice and Bob are the most similar. I put down some simple code to calculate the similarity. To account for possible plural/singular version of the keywords I use something like:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
wnl = WordNetLemmatizer()
w1 = ["movies", "movie"]
tokens = [token.lower() for token in word_tokenize(" ".join(w1))]
lemmatized_words = [wnl.lemmatize(token) for token in tokens]
So that, lemmatized_words = ["movie", "movie"].
Afterwards, I do some pairwise keywords comparison using spacy, such as:
import spacy
nlp = spacy.load('en')
t1 = nlp(u"pizza")
t2 = nlp(u"food")
sim = t1.similarity(t2)
Now, the problem starts when I have to deal with compound words such as: artificial intelligence, data science, whole food, etc. By tokenizing, I would simply split those words into 2 (e.g. artificial and intelligence), but this would affect my similarity measure. What is (would be) the best approach to take into account those type of words?

There are many ways to achieve this. One way would be to create the embeddings (vectors) yourself. This would have two advantages: first, you would be able to use bi-, tri-, and beyond (n-) grams as your tokens, and secondly, you are able to define the space that is best suited for your needs --- Wikipedia data is general, but, say, children's stories would be a more niche dataset (and more appropriate / "accurate" if you were solving problems to do with children and/or stories). There are several methods, of course word2vec being the most popular, and several packages to help you (e.g. gensim).
However, my guess is you would like something that's already out there. The best word embeddings right now are:
Numberbatch ('classic' best-in-class ensemble);
fastText, by Facebook Research (created at the character level --- some words that are out of vocabulary can be "understood" as a result);
sense2vec, by the same guys behind Spacy (created using parts-of-speech (POS) as additional information, with the objective to disambiguate).
The one we are interested in for a quick resolve of your problem is sense2vec. You should read the paper, but essentially these word embeddings were created using Reddit with additional POS information, and (thus) able to discriminate entities (e.g. nouns) that span multiple words. This blog post describes sense2vec very well. Here's some code to help you get started (taken from the prior links):
Install:
git clone https://github.com/explosion/sense2vec
pip install -r requirements.txt
pip install -e .
sputnik --name sense2vec --repository-url http://index.spacy.io install reddit_vectors
Example usage:
import sense2vec
model = sense2vec.load()
freq, query_vector = model["onion_rings|NOUN"]
freq2, query_vector2 = model["chicken_nuggets|NOUN"]
print(model.most_similar(query_vector, n=5)[0])
print(model.data.similarity(query_vector, query_vector2))
Important note, sense2vec requires spacy>=0.100,<0.101, meaning it will downgrade your current spacy install, not too much of a problem if you are only loading the en model. Also, here are the POS tags used:
ADJ ADP ADV AUX CONJ DET INTJ NOUN NUM PART PRON PROPN PUNCT SCONJ SYM VERB X
You could use spacy for POS and dependency tagging, and then sense2vec to determine the similarity of resulting entities. Or, depending on the frequency of your dataset (not too large), you could grab n-grams in descending (n) order, and sequentially check to see if each one is an entity in the sense2vec model.
Hope this helps!

There is approach using nltk:
from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer([("artificial","intelligence"), ("data","science")], separator=' ')
tokens = tokenizer.tokenize("I am really interested in data science and artificial intelligence".split())
print(tokens)
The output is given as:
['I', 'am', 'really', 'interested', 'in', 'data science', 'and', 'artificial intelligence']
For more reference you can read here.

Related

What is nlp in spacy?

Usually we start from:
nlp = spacy.load('en_encore_web_sm') # or medium, or large
or
nlp = English()
then:
doc = nlp('my text')
Then we can do a lot of fun with that even not knowing the nature of the first line.
But what exactly is 'nlp'? What is going on under the hood? Is "nlp" a pretrained model, as understood in machine learning, and therefore some big file located somewhere on the disc?
I met an explanation, that 'nlp' is an 'object, containing process pipeline', but that only explains a little.
You can always check the type of any python objects:
nlp = spacy.load('en_encore_web_sm') # or medium, or large
print(type(nlp))
print(dir(nlp)) # view a list of attributes
You will get something like this (depending on the passed arguments)
<class 'spacy.lang.en.English'>
You are right it is something like 'pretrained' model as it contains vocabulary, binary weights, etc.
Please check the official documentation:
https://spacy.io/api/language
You could infer what nlp() is by exploring it. For example:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_lg")
text = "Elon Musk 889-888-8888 elonpie#tessa.net Jeff Bezos (345)123-1234 bezzi#zonbi.com Reshma Saujani example.email#email.com 888-888-8888 Barkevious Mingo"
text = nlp(text)
print(text)
Will print the exact same text. On the other hand if you do:
for word in text.ents:
print(word.text,word.label_)
you will get the entities of the string:
Elon Musk PERSON
889-888 CARDINAL
Jeff Bezos PERSON
345)123 CARDINAL
Reshma Saujani PERSON
It is indeed large pre-trained model for the English language and has many functions (parser, lemmatizer, tagger) as the one demonstrated above. Hope this helps a bit to clarify your question.

finding semantic similarity between 2 statements

I am currently working with small application in python and my application has search functionality (currently using difflib) but I want to create Semantic Search which can give top 5 or 10 results from my database, based on user inputted text. It is same as google search engine works. I found some solutions Here.
But the problem is, below two statements from one of solution are semantically incorrect. And I don't care about this. because they are making things too hard which I don't want And also solution will be some pretrained neural network model or library from which I can implement easily.
Pete and Rob have found a dog near the station.
Pete and Rob have never found a dog near the station
And also I found some solutions which are showing using gensim and Glove embeddings and finding similarity between words and not sentences.
What I wanted ?
Suppose my db has statement display classes and user inputs show, showed, displayed, displayed class, show types etc are same. And if above 2 statements are given as same then also I don't care. displayed and displayed class already showing in difflib.
Points to be noted
Find from fixed set of statements but user inputted statements can differ
Must work for statements
I think it is not gensim embedding. It is word2vec embedding. Whatever it is.
You need tensorflow_hub
The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.
I believe you need here is Text Classification or Semantic Similarity because you want to find nearest top 5 or 10 statements given statement from user.
It is easy to use. But size of model is ≈ 1GB. It works with words, sentences, phrases or short paragraphs. The input is variable length English text and the output is a 512 dimensional vector. You can find more information about it Here
Code
import tensorflow_hub as hub
import numpy as np
# Load model. It will download first time.
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/5"
model = hub.load(module_url)
# first data[0] is your actual value
data = ["display classes", "show", "showed" ,"displayed class", "show types"]
# find high-dimensional vectors.
vecs = model(data)
# find distance between statements using inner product
dists = np.inner(vecs[0], vecs)
# print dists
print(dists)
Output
array([0.9999999 , 0.5633253 , 0.46475542, 0.85303843, 0.61701006],dtype=float32)
Conclusion
First value 0.999999 is distance between display classes and display classes itself. second 0.5633253 is distance between display classes and show and last 0.61701006 is distance between display classes and show types.
Using this, you can find distance between given input and statements in db. then rank them according to distance.
You can use wordnet for finding synonyms and then use these synonyms for finding similar statements.
import nltk
from nltk.corpus import wordnet as wn
nltk.download('wordnet')
def get_syn_list(gword):
syn_list = []
try:
syn_list.extend(wn.synsets(gword,pos=wn.NOUN))
syn_list.extend(wn.synsets(gword,pos=wn.VERB))
syn_list.extend(wn.synsets(gword,pos=wn.ADJ))
syn_list.extend(wn.synsets(gword,pos=wn.ADV))
except :
print("Something Wrong Happened")
syn_words = []
for i in syn_list:
syn_words.append(i.lemmas()[0].name())
return syn_words
Now use split and split your statements in db. like this
stat = ["display classes"]
syn_dict = {}
for i in stat:
tmp = []
for x in i.split(" "):
tmp.extend(get_syn_list(x))
syn_dict[i] = set(tmp)
Now you have synonyms just compare them with inputted text. And use lemmatizer before comparing words so that displayed become display.
Hey you can use spacy
This answer is from https://medium.com/better-programming/the-beginners-guide-to-similarity-matching-using-spacy-782fc2922f7c
import spacy
nlp = spacy.load("en_core_web_lg")
doc1 = nlp("display classes")
doc2 = nlp("show types")
print(doc1.similarity(doc2))
Output
0.6277548513279427
Edit
Run following command, which will download model.
!python -m spacy download en_core_web_lg

CoreNLP: Can it tell whether a noun refers to a person?

Can CoreNLP determine whether a common noun (as opposed to a proper noun or proper name) refers to a person out-of-the-box? Or if I need to train a model for this task, how do I go about that?
First, I am not looking for coreference resolution, but rather a building block for it. Coreference by definition depends on the context, whereas I am trying to evaluate whether a word in isolation is a subset of "person" or "human". For example:
is_human('effort') # False
is_human('dog') # False
is_human('engineer') # True
My naive attempt to use Gensim's and spaCy's pre-trained word vectors failed to rank "engineer" above the other two words.
import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100")
for word in ('effort', 'dog', 'engineer'):
print(word, word_vectors.similarity(word, 'person'))
# effort 0.42303842
# dog 0.46886832
# engineer 0.32456854
I found the following lists from CoreNLP promising.
dcoref.demonym // The path for a file that includes a list of demonyms
dcoref.animate // The list of animate/inanimate mentions (Ji and Lin, 2009)
dcoref.inanimate
dcoref.male // The list of male/neutral/female mentions (Bergsma and Lin, 2006)
dcoref.neutral // Neutral means a mention that is usually referred by 'it'
dcoref.female
dcoref.plural // The list of plural/singular mentions (Bergsma and Lin, 2006)
dcoref.singular
Would these work for my task? And if so, how would I access them from the Python wrapper? Thank you.
I would suggest trying WordNet instead and see:
if enough of your terms are covered by WordNet and
if the terms you want are hyponyms of person.n.01.
You'd have to expand this a bit to cover multiple senses, but the gist would be:
from nltk.corpus import wordnet as wn
# True
wn.synset('person.n.01') in wn.synset('engineer.n.01').lowest_common_hypernyms(wn.synset('person.n.01'))
# False
wn.synset('person.n.01') in wn.synset('dog.n.01').lowest_common_hypernyms(wn.synset('person.n.01'))
See the NLTK docs for lowest_common_hypernym: http://www.nltk.org/howto/wordnet_lch.html

NLP Extracting Related Phrases

Using NLP From a given sentence i am able to extract all adjectives and nouns easily using Core NLP.
But what im struggling to do is actually extract phrases out of the sentence.
For example i have the following sentences:
This person is trust worthy.
This person is non judgemental.
This person is well spoken.
For all these sentences using NLP i want to extract the phrases trust worthy, non judgemental, well spoken and so forth. I wanna extract all these related words.
How do i do this?
Thanks,
I think that you first need to think past these specific examples, and think about the structure of exactly what you want to extract. For example, in your cases you can use some simple heuristics to find any instance of a copular child, and all of its modifiers.
If the scope of what you need to extract is larger than that, you can go back to the drawing boards and rethink some rules based on basic linguistic features that are available in e.g. Stanford CoreNLP, or as another poster has linked, spaCy.
Finally, if you need the ability to generalize to other unknown examples, you may want to train a classifier (maybe start with a simple logistic regression classifier), by feeding it relevant linguistic features and tagging each token in a sentence as relevant or not relevant.
For your specific use case Open Information Extraction seems to be a suitable solution. It extracts triples containing a subject, a relation and an object. Your relation seems to be always be (infinitive of is) and your subject seems to be always person, so we are only interested in the object.
import edu.stanford.nlp.ie.util.RelationTriple;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.naturalli.NaturalLogicAnnotations;
import edu.stanford.nlp.util.CoreMap;
import java.util.Collection;
import java.util.Properties;
public class OpenIE {
public static void main(String[] args) {
// Create the Stanford CoreNLP pipeline
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate your sentences
Annotation doc = new Annotation("This person is trust worthy. This person is non judgemental. This person is well spoken.");
pipeline.annotate(doc);
// Loop over sentences in the document
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples = sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
triple.object.forEach(object -> System.out.print(object.get(TextAnnotation.class) + " "));
System.out.println();
}
}
}
}
The output would be the following:
trust
worthy
non judgemental
judgemental
well spoken
spoken
The OpenIE algorithm possibly extracts multiple triples per sentence. For your use case the solution might be to just take the triple with the largest number of words in the object.
Another thing to mention is that the object of your first sentence is not extracted "correctly", at least not in the way you want it. This is happening because trust is a noun and worthy is an adjective.
The easiest solution would be to write it with a hyphen (trust-worthy).
Another possible solution is to check the Part of Speech Tags and perform some additional steps when you encounter a noun followed by an adjective.
To check similarity between similar phrases, you could use word embeddings such as GLOVE. Some NLP libraries come with the embeddings, such as Spacy. https://spacy.io/usage/vectors-similarity
Note: Spacy uses cosine similarity on both a token level and a phrase level, and Spacy also offers a convenience similarity function for larger phrases/sentences.
For example:
(using spacy & python)
doc1 = nlp(u"The person is trustworthy.")
doc2 = nlp(u"The person is non judgemental.")
cosine_similarity = doc1.similarity(doc2)
And cosine_similarity could be used to show how similar two phrases/words/sentences are, ranging from 0 to 1, where 1 is very similar.

Classify words with the same meaning

I have 50.000 subject lines from emails and i want to classify the words in them based on synonyms or words that can be used instead of others.
For example:
Top sales!
Best sales
I want them to be in the same group.
I build the following function with nltk's wordnet but it doesn't work well.
def synonyms(w,group,guide):
try:
# Check if the words is similar
w1 = wordnet.synset(w +'.'+guide+'.01')
w2 = wordnet.synset(group +'.'+guide+'.01')
if w1.wup_similarity(w2)>=0.7:
return True
elif w1.wup_similarity(w2)<0.7:
return False
except:
return False
Any ideas or tools to accomplish this?
The easiest way to accomplish this would be to compare the similarity of the respective word embeddings (the most common implementation of this is Word2Vec).
Word2Vec is a way of representing the semantic meaning of a token in a vector space, which enables the meanings of words to be compared without requiring a large dictionary/thesaurus like WordNet.
One problem with regular implementations of Word2Vec is that it does differentiate between different senses of the same word. For example, the word bank would have the same Word2Vec representation in all of these sentences:
The river bank was dry.
The bank loaned money to me.
The plane may bank to the left.
Bank has the same vector in each of these cases, but you may want them to be sorted into different groups.
One way to solve this is to use a Sense2Vec implementation. Sense2Vec models take into account the context and part of speech (and potentially other features) of the token, allowing you to differentiate between the meanings of different senses of the word.
A great library for this in Python is Spacy. It is like NLTK, but much faster as it is written in Cython (20x faster for tokenization and 400x faster for tagging). It also has Sense2Vec embeddings inbuilt, so you can accomplish your similarity task without needing other libraries.
It's as simple as:
import spacy
nlp = spacy.load('en')
apples, and_, oranges = nlp(u'apples and oranges')
apples.similarity(oranges)
It's free and has a liberal license!
An idea is to solve this with embeddings and word2vec , the outcome will be a mapping from words to vectors which are "near" when they have similar meanings, for example "car" and "vehicle" will be near and "car" and "food" will not, you can then measure the vector distance between 2 words and define a threshold to select if they are so near that they mean the same, as i said its just an idea of word2vec
The computation behind what Nick said is to calculate the distance (cosine distance) between two phrases vectors.
Top sales!
Best sales
Here is one way to do so: How to calculate phrase similarity between phrases

Resources