Using NLP From a given sentence i am able to extract all adjectives and nouns easily using Core NLP.
But what im struggling to do is actually extract phrases out of the sentence.
For example i have the following sentences:
This person is trust worthy.
This person is non judgemental.
This person is well spoken.
For all these sentences using NLP i want to extract the phrases trust worthy, non judgemental, well spoken and so forth. I wanna extract all these related words.
How do i do this?
Thanks,
I think that you first need to think past these specific examples, and think about the structure of exactly what you want to extract. For example, in your cases you can use some simple heuristics to find any instance of a copular child, and all of its modifiers.
If the scope of what you need to extract is larger than that, you can go back to the drawing boards and rethink some rules based on basic linguistic features that are available in e.g. Stanford CoreNLP, or as another poster has linked, spaCy.
Finally, if you need the ability to generalize to other unknown examples, you may want to train a classifier (maybe start with a simple logistic regression classifier), by feeding it relevant linguistic features and tagging each token in a sentence as relevant or not relevant.
For your specific use case Open Information Extraction seems to be a suitable solution. It extracts triples containing a subject, a relation and an object. Your relation seems to be always be (infinitive of is) and your subject seems to be always person, so we are only interested in the object.
import edu.stanford.nlp.ie.util.RelationTriple;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.naturalli.NaturalLogicAnnotations;
import edu.stanford.nlp.util.CoreMap;
import java.util.Collection;
import java.util.Properties;
public class OpenIE {
public static void main(String[] args) {
// Create the Stanford CoreNLP pipeline
Properties props = new Properties();
props.setProperty("annotators", "tokenize,ssplit,pos,lemma,depparse,natlog,openie");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// Annotate your sentences
Annotation doc = new Annotation("This person is trust worthy. This person is non judgemental. This person is well spoken.");
pipeline.annotate(doc);
// Loop over sentences in the document
for (CoreMap sentence : doc.get(CoreAnnotations.SentencesAnnotation.class)) {
// Get the OpenIE triples for the sentence
Collection<RelationTriple> triples = sentence.get(NaturalLogicAnnotations.RelationTriplesAnnotation.class);
// Print the triples
for (RelationTriple triple : triples) {
triple.object.forEach(object -> System.out.print(object.get(TextAnnotation.class) + " "));
System.out.println();
}
}
}
}
The output would be the following:
trust
worthy
non judgemental
judgemental
well spoken
spoken
The OpenIE algorithm possibly extracts multiple triples per sentence. For your use case the solution might be to just take the triple with the largest number of words in the object.
Another thing to mention is that the object of your first sentence is not extracted "correctly", at least not in the way you want it. This is happening because trust is a noun and worthy is an adjective.
The easiest solution would be to write it with a hyphen (trust-worthy).
Another possible solution is to check the Part of Speech Tags and perform some additional steps when you encounter a noun followed by an adjective.
To check similarity between similar phrases, you could use word embeddings such as GLOVE. Some NLP libraries come with the embeddings, such as Spacy. https://spacy.io/usage/vectors-similarity
Note: Spacy uses cosine similarity on both a token level and a phrase level, and Spacy also offers a convenience similarity function for larger phrases/sentences.
For example:
(using spacy & python)
doc1 = nlp(u"The person is trustworthy.")
doc2 = nlp(u"The person is non judgemental.")
cosine_similarity = doc1.similarity(doc2)
And cosine_similarity could be used to show how similar two phrases/words/sentences are, ranging from 0 to 1, where 1 is very similar.
Related
Usually we start from:
nlp = spacy.load('en_encore_web_sm') # or medium, or large
or
nlp = English()
then:
doc = nlp('my text')
Then we can do a lot of fun with that even not knowing the nature of the first line.
But what exactly is 'nlp'? What is going on under the hood? Is "nlp" a pretrained model, as understood in machine learning, and therefore some big file located somewhere on the disc?
I met an explanation, that 'nlp' is an 'object, containing process pipeline', but that only explains a little.
You can always check the type of any python objects:
nlp = spacy.load('en_encore_web_sm') # or medium, or large
print(type(nlp))
print(dir(nlp)) # view a list of attributes
You will get something like this (depending on the passed arguments)
<class 'spacy.lang.en.English'>
You are right it is something like 'pretrained' model as it contains vocabulary, binary weights, etc.
Please check the official documentation:
https://spacy.io/api/language
You could infer what nlp() is by exploring it. For example:
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_lg")
text = "Elon Musk 889-888-8888 elonpie#tessa.net Jeff Bezos (345)123-1234 bezzi#zonbi.com Reshma Saujani example.email#email.com 888-888-8888 Barkevious Mingo"
text = nlp(text)
print(text)
Will print the exact same text. On the other hand if you do:
for word in text.ents:
print(word.text,word.label_)
you will get the entities of the string:
Elon Musk PERSON
889-888 CARDINAL
Jeff Bezos PERSON
345)123 CARDINAL
Reshma Saujani PERSON
It is indeed large pre-trained model for the English language and has many functions (parser, lemmatizer, tagger) as the one demonstrated above. Hope this helps a bit to clarify your question.
Can CoreNLP determine whether a common noun (as opposed to a proper noun or proper name) refers to a person out-of-the-box? Or if I need to train a model for this task, how do I go about that?
First, I am not looking for coreference resolution, but rather a building block for it. Coreference by definition depends on the context, whereas I am trying to evaluate whether a word in isolation is a subset of "person" or "human". For example:
is_human('effort') # False
is_human('dog') # False
is_human('engineer') # True
My naive attempt to use Gensim's and spaCy's pre-trained word vectors failed to rank "engineer" above the other two words.
import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100")
for word in ('effort', 'dog', 'engineer'):
print(word, word_vectors.similarity(word, 'person'))
# effort 0.42303842
# dog 0.46886832
# engineer 0.32456854
I found the following lists from CoreNLP promising.
dcoref.demonym // The path for a file that includes a list of demonyms
dcoref.animate // The list of animate/inanimate mentions (Ji and Lin, 2009)
dcoref.inanimate
dcoref.male // The list of male/neutral/female mentions (Bergsma and Lin, 2006)
dcoref.neutral // Neutral means a mention that is usually referred by 'it'
dcoref.female
dcoref.plural // The list of plural/singular mentions (Bergsma and Lin, 2006)
dcoref.singular
Would these work for my task? And if so, how would I access them from the Python wrapper? Thank you.
I would suggest trying WordNet instead and see:
if enough of your terms are covered by WordNet and
if the terms you want are hyponyms of person.n.01.
You'd have to expand this a bit to cover multiple senses, but the gist would be:
from nltk.corpus import wordnet as wn
# True
wn.synset('person.n.01') in wn.synset('engineer.n.01').lowest_common_hypernyms(wn.synset('person.n.01'))
# False
wn.synset('person.n.01') in wn.synset('dog.n.01').lowest_common_hypernyms(wn.synset('person.n.01'))
See the NLTK docs for lowest_common_hypernym: http://www.nltk.org/howto/wordnet_lch.html
The following is my code where I take an user input.
import en_core_web_sm
nlp = en_core_web_sm.load()
text = input("please enter your text or words here")
doc = nlp(text)
print([t.text for t in doc])
If the user input the text as Deep Learning, the text is broken into
('Deep', 'Learning')
How to add an whitespace exception in nlp? such that the output is like below
(Deep Learning)
The raw text from the user input is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:
Does the substring match a tokenizer exception rule? For example, "don't" does not contain whitespace, but should be split into two tokens, "do" and "n't", while "U.K." should always remain one token.
Can a prefix, suffix or infix be split off? For example punctuation like commas, periods, hyphens or quotes.
So if your user types in: Looking for Deep Learning experts
It will be tokenized as: ('Looking', 'for, 'Deep', 'Learning', 'experts')
Spacy does not know that Deep Learning is an entity on it's own. If you want spaCy to recognize Deep Learning as a single entity, you need to teach it. If you have a predefined list of words that you would want spaCy to recognize as a single entity, you can use PhraseMatcher to do that.
You can check the details on how to use PhraseMatcher here
UPDATE - Reply to OP's comment below
I do not think there is a way spaCy can know about the entity you are looking for without being trained in the context of your domain or being provided a predefined subset of the entities.
The only solution I can think of is to use an annotation tool to teach spaCy
- Take a subset of your user inputs and annotate them manually (you can use the prodigy tool by the makers of spaCy or Brat - it's free)
- Use the annotations to train a new or existing NER model. Details on training a model can be found [here](here
Given a text like "Looking for Deep Learning experts", you would annotate "Deep Learning" with a label such as "FIELD". Then train a new entity type, 'FIELD'.
Once you have trained the model in the context, spaCy will learn to detect entities of interest.
I'm trying to identify user similarities by comparing the keywords used in their profile (from a website). For example, Alice = pizza, music, movies, Bob = cooking, guitar, movie and Eve = knitting, running, gym. Ideally, Alice and Bob are the most similar. I put down some simple code to calculate the similarity. To account for possible plural/singular version of the keywords I use something like:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
wnl = WordNetLemmatizer()
w1 = ["movies", "movie"]
tokens = [token.lower() for token in word_tokenize(" ".join(w1))]
lemmatized_words = [wnl.lemmatize(token) for token in tokens]
So that, lemmatized_words = ["movie", "movie"].
Afterwards, I do some pairwise keywords comparison using spacy, such as:
import spacy
nlp = spacy.load('en')
t1 = nlp(u"pizza")
t2 = nlp(u"food")
sim = t1.similarity(t2)
Now, the problem starts when I have to deal with compound words such as: artificial intelligence, data science, whole food, etc. By tokenizing, I would simply split those words into 2 (e.g. artificial and intelligence), but this would affect my similarity measure. What is (would be) the best approach to take into account those type of words?
There are many ways to achieve this. One way would be to create the embeddings (vectors) yourself. This would have two advantages: first, you would be able to use bi-, tri-, and beyond (n-) grams as your tokens, and secondly, you are able to define the space that is best suited for your needs --- Wikipedia data is general, but, say, children's stories would be a more niche dataset (and more appropriate / "accurate" if you were solving problems to do with children and/or stories). There are several methods, of course word2vec being the most popular, and several packages to help you (e.g. gensim).
However, my guess is you would like something that's already out there. The best word embeddings right now are:
Numberbatch ('classic' best-in-class ensemble);
fastText, by Facebook Research (created at the character level --- some words that are out of vocabulary can be "understood" as a result);
sense2vec, by the same guys behind Spacy (created using parts-of-speech (POS) as additional information, with the objective to disambiguate).
The one we are interested in for a quick resolve of your problem is sense2vec. You should read the paper, but essentially these word embeddings were created using Reddit with additional POS information, and (thus) able to discriminate entities (e.g. nouns) that span multiple words. This blog post describes sense2vec very well. Here's some code to help you get started (taken from the prior links):
Install:
git clone https://github.com/explosion/sense2vec
pip install -r requirements.txt
pip install -e .
sputnik --name sense2vec --repository-url http://index.spacy.io install reddit_vectors
Example usage:
import sense2vec
model = sense2vec.load()
freq, query_vector = model["onion_rings|NOUN"]
freq2, query_vector2 = model["chicken_nuggets|NOUN"]
print(model.most_similar(query_vector, n=5)[0])
print(model.data.similarity(query_vector, query_vector2))
Important note, sense2vec requires spacy>=0.100,<0.101, meaning it will downgrade your current spacy install, not too much of a problem if you are only loading the en model. Also, here are the POS tags used:
ADJ ADP ADV AUX CONJ DET INTJ NOUN NUM PART PRON PROPN PUNCT SCONJ SYM VERB X
You could use spacy for POS and dependency tagging, and then sense2vec to determine the similarity of resulting entities. Or, depending on the frequency of your dataset (not too large), you could grab n-grams in descending (n) order, and sequentially check to see if each one is an entity in the sense2vec model.
Hope this helps!
There is approach using nltk:
from nltk.tokenize import MWETokenizer
tokenizer = MWETokenizer([("artificial","intelligence"), ("data","science")], separator=' ')
tokens = tokenizer.tokenize("I am really interested in data science and artificial intelligence".split())
print(tokens)
The output is given as:
['I', 'am', 'really', 'interested', 'in', 'data science', 'and', 'artificial intelligence']
For more reference you can read here.
I have a document with tagged data in the format Hi here's my [KEYWORD phone number], let me know when you wanna hangout: [PHONE 7802708523]. I live in a [PROP_TYPE condo] in [CITY New York]. I want to train a model based on a set of these type of tagged documents, and then use my model to tag new documents. Is this possible in NLTK? I have looked at chunking and NLTK-Trainer scripts, but these have a restricted set of tags and corpora, while my dataset has custom tags.
As #AleksandarSavkov wrote already, this is essentially a named entity recognition (NER) task-- or more generally a chunking task, as you already realize. How to do it is covered nicely in chapter 7 of the NLTK book. I recommend you ignore the sections on regexp tagging and use the approach in section 3, Developing and evaluating chunkers. It includes code samples you can use verbatim to create a chunker (the ConsecutiveNPChunkTagger). Your responsibility is to select features that will give you good performance.
You'll need to transform your data into the IOB format expected by the NLTK's architecture; it expects part of speech tags, so the first step should be to run your input through a POS tagger; nltk.pos_tag() will do a decent enough job (once you strip off markup like [KEYWORD ...]), and requires no additional software to be installed. When your corpus is in the following format (word -- POS-tag -- IOB-tag), you are ready to train a recognizer:
Hi NNP O
here RB O
's POS O
my PRP$ O
phone NN B-KEYWORD
number NN I-KEYWORD
, , O
let VB O
me PRP O
...
The problem you are looking to solve is called most commonly, Named Entity Recognition (NER). There are many algorithms that can help you solve the problem, but the most important notion is that you need to convert your text data into a suitable format for sequence taggers. Here is an example of the BIO format:
I O
love O
Paris B-LOC
and O
New B-LOC
York I-LOC
. O
From there, you can choose to train any type of classifier, such as Naive Bayes, SVM, MaxEnt, CRF, etc. Currently the most popular algorithm for such multi-token sequence classification tasks is CRF. There are available tools that will let you train a BIO model (although originally intended for chunking) from a file using the format shown above (e.g. YamCha, CRF++, CRFSuite, Wapiti). If you are using Python you can look into scikit-learn, python-crfsuite and PyStruct in addition to NLTK.