Extracting sentences from a text document - nlp

I have a text document from which I'd like to extract the Noun phrases. In the first step I extract sentences and then I do a part of speech (pos) tagging for each sentence and then using the pos I do a chunking. I used StanfordNLP for these task, and this is the code for extracting the sentences.
Reader reader = new StringReader(text);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
I think DocumentPreprocessor does a pos under the hood in order to extract the sentences. However, I'm doing another pos for extracting the noun phrases in the second phase as well. That is, pos is done twice and because pos is a computationally expensive task, I'm looking for a way to do it only once. Is there any way to do pos only once to extract sentences and noun phrases?

No, DocumentPreprocessor does not run a tagger while it loads the text. (NB, it does have the capability to parse pre-tagged text, i.e. parse tokens in a file like dog_NN.)
In short: you aren't doing extra work, so I suppose that's good news!

I'm not sure. Try using nltk (python package)
import nltk
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]

Related

Add rules to Spacy lemmatization

I am using Spacy lemmatization for preprocessing texts.
doc = 'ups'
for i in nlp(doc):
print(i.lemma_)
>> up
I understand why spacy remove the 's', but it is important for me that in that case, it won't do it. Is there a way to add specific rules to spacy or do I have to use If statements outside the process (which is something I don't want to do )
In Spacy 3 the accepted solution throws an error:
KeyError: "[E159] Can't find table 'lemma_exc' in lookups. Available tables: ['lexeme_norm']"
As the lemmatizer is now a dedicated Spacy component the lookups have to be modified directly at the component (this at least works for me):
nlp.get_pipe('lemmatizer').lookups.get_table("lemma_exc")["noun"]["data"] = ["data"]
Hope this is helpful to someone!
For spacy v2:
Depending on whether you have a tagger, you can customize the rule-based lemmatizer exceptions or the lookup table:
import spacy
# original
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
# may be "up" or "ups" depending on exact version of spacy/model because it
# depends on the POS tag
assert nlp("ups")[0].lemma_ in ("ups", "up")
# 1. Exception for rule-based lemmatizer (with tagger)
# reload to start with a clean lemma cache
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
# add an exception for "ups" as with POS NOUN or VERB
nlp.vocab.lookups.get_table("lemma_exc")["noun"]["ups"] = ["ups"]
nlp.vocab.lookups.get_table("lemma_exc")["verb"]["ups"] = ["ups"]
assert nlp("ups")[0].lemma_ == "ups"
# 2. New entry for lookup lemmatizer (without tagger)
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
nlp.vocab.lookups.get_table("lemma_lookup")["ups"] = "ups"
assert nlp("ups")[0].lemma_ == "ups"
If you are processing words in isolation, the tagger is not going to be very reliable (you might get NOUN, PROPN, or VERB for something like ups), so it might be easier to deal with customizing the lookup lemmatizer. The quality of the rule-based lemmas is better overall, but you need at least full phrases, preferably full sentences, to get reasonable results.

Spacy NLP: For proper nouns that can be verbs - Ambiguities according to input order and split based on punctuation

I am using spacy NLP. No parser can always correctly determine the PROPN/ NOUN / VERB status of an ambiguous token, since in most languages a word spelt the same can have different meanings.
For example, "Encounter" can be an Encounter in terms of aliens zapping you into their spaceship(Noun, an occurence), or "Encounter" like "Encounter the world" --> Come into contact with (Verb).
Spacy sometimes identifies the same spelt-word differently, even in similar situations:
Is it the punctuation(the "="?) that causes this?
I expected continuity of identification of the token as either verb or noun, but not to change. I understand using the trained spacy data (using en_small and en_medium) does not use LSTM as it progresses so I should not expect spacy to "establish continuity due to a previous decision in the same sentence", but I am still surprised given the same sentence format, also identical content, spacy identifies differently..
Encounter the world. Encounter the self" and "Encounter the world=Encounter the self
=>
parses to VERB, NOUN respectively
"Encounter the self. Encounter the world."
=> parses to VERB, VERB.
"Encounter the self"
=> parses to VERB
Make sure that you are using an up to date version of spacy and an up to date model such as en_core_web_lg .
On my setup, I do not get the error you are describing:
nlp = spacy.load('en_core_web_lg')
doc = nlp("Encounter the world=Encounter the self.")
print([(t, t.pos_) for t in doc])
# [(Encounter, 'VERB'), (the, 'DET'), (world, 'NOUN'), (=, 'PUNCT'), (Encounter, 'VERB'), (the, 'DET'), (self, 'NOUN'), (., 'PUNCT')]
My version of spacy:
print(spacy.__version__)
# 2.2.1

How to use tokenized sentence as input for Spacy's PoS tagger?

Spacy's pos tagger is really convenient, it can directly tag on raw sentence.
import spacy
sp = spacy.load('en_core_web_sm')
sen = sp(u"I am eating")
But I'm using tokenizer from nltk. So how to use a tokenized sentence like
['I', 'am', 'eating'] rather than 'I am eating' for the Spacy's tagger?
BTW, where can I found detailed Spacy documentation?
I can only find an overview on the official website
Thanks.
There's two options:
You write a wrapper around the nltk tokenizer and use it to convert text to spaCy's Doc format. Then overwrite nlp.tokenizer with that new custom function. More info here: https://spacy.io/usage/linguistic-features#custom-tokenizer.
Generate a Doc directly from a list of strings, like so:
doc = Doc(nlp.vocab, words=[u"I", u"am", u"eating", u"."],
spaces=[True, True, False, False])
Defining the spaces is optional - if you leave it out, each word will be followed by a space by default. This matters when using e.g. the doc.text afterwards. More information here: https://spacy.io/usage/linguistic-features#own-annotations
[edit]: note that nlp and doc are sort of 'standard' variable names in spaCy, they correspond to the variables sp and sen respectively in your code

Is there a bi gram or tri gram feature in Spacy?

The below code breaks the sentence into individual tokens and the output is as below
"cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies"
import en_core_web_sm
nlp = en_core_web_sm.load()
doc = nlp("Cloud computing is benefiting major manufacturing companies")
for token in doc:
print(token.text)
What I would ideally want is, to read 'cloud computing' together as it is technically one word.
Basically I am looking for a bi gram. Is there any feature in Spacy that allows Bi gram or Tri grams ?
Spacy allows the detection of noun chunks. So to parse your noun phrases as single entities do this:
Detect the noun chunks
https://spacy.io/usage/linguistic-features#noun-chunks
Merge the noun chunks
Do dependency parsing again, it would parse "cloud computing" as single entity now.
>>> import spacy
>>> nlp = spacy.load('en')
>>> doc = nlp("Cloud computing is benefiting major manufacturing companies")
>>> list(doc.noun_chunks)
[Cloud computing, major manufacturing companies]
>>> for noun_phrase in list(doc.noun_chunks):
... noun_phrase.merge(noun_phrase.root.tag_, noun_phrase.root.lemma_, noun_phrase.root.ent_type_)
...
Cloud computing
major manufacturing companies
>>> [(token.text,token.pos_) for token in doc]
[('Cloud computing', 'NOUN'), ('is', 'VERB'), ('benefiting', 'VERB'), ('major manufacturing companies', 'NOUN')]
If you have a spacy doc, you can pass it to textacy:
ngrams = list(textacy.extract.basics.ngrams(doc, 2, min_freq=2))
Warning: This is just an extension of the right answer made by Zuzana.
My reputation does not allow me to comment so I am making this answer just to answer the question of Adit Sanghvi above: "How do you do it when you have a list of documents?"
First you need to create a list with the text of the documents
Then you join the text lists in just one document
now you use the spacy parser to transform the text document in a Spacy document
You use the Zuzana's answer's to create de bigrams
This is the example code:
Step 1
doc1 = ['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code']
doc2 = ['how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy']
doc3 = ['i love to repeat phrases to make bigrams because i love make bigrams']
listOfDocuments = [doc1,doc2,doc3]
textList = [''.join(textList) for text in listOfDocuments for textList in text]
print(textList)
This will print this text:
['all what i want is that you give me back my code because i worked a lot on it. Just give me back my code', 'how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy', 'i love to repeat phrases to make bigrams because i love make bigrams']
then step 2 and 3:
doc = ' '.join(textList)
spacy_doc = parser(doc)
print(spacy_doc)
and will print this:
all what i want is that you give me back my code because i worked a lot on it. Just give me back my code how are you? i am just showing you an example of how to make bigrams on spacy. We love bigrams on spacy i love to repeat phrases to make bigrams because i love make bigrams
Finally step 4 (Zuzana's answer)
ngrams = list(textacy.extract.ngrams(spacy_doc, 2, min_freq=2))
print(ngrams)
will print this:
[make bigrams, make bigrams, make bigrams]
I had a similar problem (bigrams, trigrams, like your "cloud computing"). I made a simple list of the n-grams, word_3gram, word_2grams etc., with the gram as basic unit (cloud_computing).
Assume I have the sentence "I like cloud computing because it's cheap". The sentence_2gram is: "I_like", "like_cloud", "cloud_computing", "computing_because" ... Comparing that your bigram list only "cloud_computing" is recognized as a valid bigram; all other bigrams in the sentence are artificial. To recover all other words you just take the first part of the other words,
"I_like".split("_")[0] -> I;
"like_cloud".split("_")[0] -> like
"cloud_computing" -> in bigram list, keep it.
skip next bi-gram "computing_because" ("computing" is already used)
"because_it's".split("_")[0]" -> "because" etc.
To also capture the last word in the sentence ("cheap") I added the token "EOL". I implemented this in python, and the speed was OK (500k words in 3min), i5 processor with 8G. Anyway, you have to do it only once. I find this more intuitive than the official (spacy-style) chunk approach. It also works for non-spacy frameworks.
I do this before the official tokenization/lemmatization, as you would get "cloud compute" as possible bigram. But I'm not certain if this is the best/right approach.

How to POS_TAG a french sentence?

I'm looking for a way to pos_tag a French sentence like the following code is used for English sentences:
def pos_tagging(sentence):
var = sentence
exampleArray = [var]
for item in exampleArray:
tokenized = nltk.word_tokenize(item)
tagged = nltk.pos_tag(tokenized)
return tagged
here is the full code source it works very well
download link for Standford NLP https://nlp.stanford.edu/software/tagger.shtml#About
from nltk.tag import StanfordPOSTagger
jar = 'C:/Users/m.ferhat/Desktop/stanford-postagger-full-2016-10-31/stanford-postagger-3.7.0.jar'
model = 'C:/Users/m.ferhat/Desktop/stanford-postagger-full-2016-10-31/models/french.tagger'
import os
java_path = "C:/Program Files/Java/jdk1.8.0_121/bin/java.exe"
os.environ['JAVAHOME'] = java_path
pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8' )
res = pos_tagger.tag('je suis libre'.split())
print (res)
The NLTK doesn't come with pre-built resources for French. I recommend using the Stanford tagger, which comes with a trained French model. This code shows how you might set up the nltk for use with Stanford's French POS tagger. Note that the code is outdated (and for Python 2), but you could use it as a starting point.
Alternately, the NLTK makes it very easy to train your own POS tagger on a tagged corpus, and save it for later use. If you have access to a (sufficiently large) French corpus, you can follow the instructions in the nltk book and simply use your corpus in place of the Brown corpus. You're unlikely to match the performance of the Stanford tagger (unless you can train a tagger for your specific domain), but you won't have to install anything.

Resources