First experience with NLP here. I have about a half million tweets. I'm trying to use spacy to remove stop words, lemmatize, etc. and then pass the processed text to a classification model. Because of the size of the data I need multiprocessing to do this in reasonable speed, but can't figure out what to do with the generator object once I have it.
Here I load spacy and pass the data through the standard pipeline:
nlp = spacy.load('en')
tweets = ['This is a dummy tweet for stack overflow',
'What do we do with generator objects?']
spacy_tweets = []
for tweet in tweets:
doc_tweet = nlp.pipe(tweet, batch_size = 10, n_threads = 3)
spacy_tweets.append(doc_tweet)
Now I'd like to take the Doc objects spaCy creates and then process the text with something like this:
def spacy_tokenizer(tweet):
tweet = [tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_ for tok in tweet]
tweet = [tok for tok in tweet if (tok not in stopwords and tok not in punctuations)]
return tweet
But this doesn't work because spaCy returns generator objects when using the .pipe() method. So when I do this:
for tweet in spacy_tweets:
print(tweet)
It prints the generator. Okay, I get that. But when I do this:
for tweet in spacy_tweets[0]:
print(tweet)
I would expect it to print the Doc object or the text of the tweet in the generator but it doesn't do that. Instead it prints each character our individually.
Am I approaching this wrong or is there something I need to do in order to retrieve the Doc objects from the generator objects so I can use the spaCy attributes for lemmatizing etc.?
I think that you are using wrongly the nlp.pipe command.
nlp.pipe is for parallelization which means that it processes simultaneously tweets. So, instead of giving to nlp.pipe command a single tweet as an argument, you should pass the tweets list.
The following code seems to achieve your goal:
import spacy
nlp = spacy.load('en')
tweets = ['This is a dummy tweet for stack overflow',
'What do we do with generator objects?']
spacy_tweets = nlp.pipe(tweets, batch_size = 10, n_threads = 3)
for tweet in spacy_tweets:
for token in tweet:
print(token.text, token.pos_)
Hope it helps!
Related
I am experimenting with spacy for information extraction and would like to return given tokens, such as object of preposition (pobj) and any compounds.
For the example below I am trying to write code that will return 'radar swivel'
So far I have tried:
#component/assy
import spacy
# load english language model
nlp = spacy.load('en_core_web_sm', disable=['ner','textcat'])
def component(text):
doc = nlp(text)
for token in doc:
# extract object
if (token.dep_=='pobj'):
return(token.text)
elif (token.dep_=='compound'):
return(token.text)
df['Component'] = df['Text'].apply(lambda x: component(x))
df.head()
This returns the word 'swivel' but not the proceeded compound 'radar', is there a way I can rewrite the code to detect the pobj and return this with any associated compounds? Thanks!
The return statements are breaking the loop that's why when you arrive at the token which is pobj you move to the next sentence without checking the compounds on the other.
To fix that you use the following function. Once it finds a pobj it looks at its children and checks which ones are compounds:
def component(text):
doc = nlp(text)
for token in doc:
if (token.dep_=='pobj'):
compounds = [child.text for child in token.children if child.dep_ == "compound"]
yield " ".join(compounds) + " " + token.text
df['Component'] = df['Text'].apply(lambda x: list(component(x)))
df.head()
I am looking for a way to singularize noun chunks with spacy
S='There are multiple sentences that should include several parts and also make clear that studying Natural language Processing is not difficult '
nlp = spacy.load('en_core_web_sm')
doc = nlp(S)
[chunk.text for chunk in doc.noun_chunks]
# = ['an example sentence', 'several parts', 'Natural language Processing']
You can also get the "root" of the noun chunk:
[chunk.root.text for chunk in doc.noun_chunks]
# = ['sentences', 'parts', 'Processing']
I am looking for a way to singularize those roots of the chunks.
GOAL: Singulirized: ['sentence', 'part', 'Processing']
Is there any obvious way? Is that always depending on the POS of every root word?
Thanks
note:
I found this: https://www.geeksforgeeks.org/nlp-singularizing-plural-nouns-and-swapping-infinite-phrases/
but that approach looks to me that leads to many many different methods and of course different for every language. ( I am working in EN, FR, DE)
To get the basic form of each word, you can use ".lemma_" property of chunk or token property
I use Spacy version 2.x
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
doc = nlp('did displaying words')
print (" ".join([token.lemma_ for token in doc]))
and the output :
do display word
Hope it helps :)
There is! You can take the lemma of the head word in each noun chunk.
[chunk.root.lemma_ for chunk in doc.noun_chunks]
Out[82]: ['sentence', 'part', 'processing']
I am using spaCy to process sentences of a doc. Given one sentence, I'd like to get the previous and following sentence.
I can easily iterate over the sentences of the doc as following:
nlp_content = nlp(content)
sentences = nlp_content.sents
for idx, sent in enumerate(sentences):
But I can't get the sentence #idx-1 or #idx+1 from the sentence #idx.
Is there any function or property that could be useful there?
Thanks!
Nick
There isn't a built-in sentence index. You would need to iterate over the sentences once to create your own list of sentence spans to access them this way.
sentence_spans = tuple(doc.sents) # alternately: list(doc.sents)
spaCy POS tagger is usally used on entire sentences. Is there a way to efficiently apply a unigram POS tagging to a single word (or a list of single words)?
Something like this:
words = ["apple", "eat", good"]
tags = get_tags(words)
print(tags)
> ["NNP", "VB", "JJ"]
Thanks.
English unigrams are often hard to tag well, so think about why you want to do this and what you expect the output to be. (Why is the POS of apple in your example NNP? What's the POS of can?)
spacy isn't really intended for this kind of task, but if you want to use spacy, one efficient way to do it is:
import spacy
nlp = spacy.load('en')
# disable everything except the tagger
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "tagger"]
nlp.disable_pipes(*other_pipes)
# use nlp.pipe() instead of nlp() to process multiple texts more efficiently
for doc in nlp.pipe(words):
if len(doc) > 0:
print(doc[0].text, doc[0].tag_)
See the documentation for nlp.pipe(): https://spacy.io/api/language#pipe
You can do something like this:
import spacy
nlp = spacy.load("en_core_web_sm")
word_list = ["apple", "eat", "good"]
for word in word_list:
doc = nlp(word)
print(doc[0].text, doc[0].pos_)
alternatively, you can do
import spacy
nlp = spacy.load("en_core_web_sm")
doc = spacy.tokens.doc.Doc(nlp.vocab, words=word_list)
for name, proc in nlp.pipeline:
doc = proc(doc)
pos_tags = [x.pos_ for x in doc]
I have a corpus(Hotel Reviews) and I want to do some NLP process including Tfidf. My problem is when I Applied Tfidf and print 100 features it doesn't appear as a single word but the entire sentence.
Here is my code:
Note: clean_doc is a function return my corpus cleaning from stopwords, stemming, and etc
vectorizer = TfidfVectorizer(analyzer='word',tokenizer=clean_doc,
max_features=100, lowercase = False, ngram_range=(1,3), min_df = 1)
vz = vectorizer.fit_transform(list(data['Review']))
feature_names = vectorizer.get_feature_names()
for feature in feature_names:
print(feature)
it returns something like this:
love view good room
food amazing recommended
bad services location far
-----
any idea why? Thanks in Advance
There is most likely an error in your clean_doc function. The 'tokenizer' argument should be a function that takes a string as input and returns a list of tokens.