How to add stop words from Tfidvectorizer? - python-3.x

I am trying to add stop words into my stop_word list, however, the code I am using doesn't seem to be working:
Creating stop words list:
stopwords = nltk.corpus.stopwords.words('english')
CustomListofWordstoExclude = ['rt']
stopwords1 = stopwords.extend(CustomListofWordstoExclude)
Here I am converting the text to a dtm (document term matrix) with tfidf weighting:
vect = TfidfVectorizer(stop_words = 'english', min_df=150, token_pattern=u'\\b[^\\d\\W]+\\b')
dtm = vect.fit_transform(df['tweets'])
dtm.shape
But when I do this, I get this error:
FutureWarning: Pass input=None as keyword args. From version 0.25 passing these as positional arguments will result in an error
warnings.warn("Pass {} as keyword args. From version 0.25 "
What does this mean? Is there an easier way to add stopwords?

I'm unable to reproduce the warning. However, note that a warning such as this does not mean that your code did not run as intended. It means that in future releases of the package it may not work as intended. So if you try the same thing next year with updated packages, it may not work.
With respect to your question about using stop words, there are two changes that need to be made for your code to work as you expect.
list.extend() extends the list in-place, but it doesn't return the list. To see this you can do type(stopwords1) which gives NoneType. To define a new variable and add the custom words list to stopwords in one line, you could just use the built-in + operator functionality for lists:
stopwords = nltk.corpus.stopwords.words('english')
CustomListofWordstoExclude = ['rt']
stopwords1 = stopwords + CustomListofWordstoExclude
To actually use stopwords1 as your new stopwords list when performing the TF-IDF vectorization, you need to pass stop_words=stopwords1:
vect = TfidfVectorizer(stop_words=stopwords1, # Passed stopwords1 here
min_df=150,
token_pattern=u'\\b[^\\d\\W]+\\b')
dtm = vect.fit_transform(df['tweets'])
dtm.shape

Related

How do I print exact sentence by filtering using regular expression in Python

I'm new to regular expression and got stuck with the code below.
import re
s = "5. Consider the task in Figure 8.11, which are balanced in fig 99.2"
output = re.findall((r'[A-Z][a-z]*'), s)[0]
output2 = re.findall(r'\b[^A-Z\s\d]+\b', s)
mixing = " ".join(str(x) for x in output2)
finalmix = output+" " + mixing
print(finalmix)
Here I'm trying to print "Consider the task in Figure 8.11, which are balanced in fig 99.2" from the given string s' as a sentence in output. So I joined the two outputs using join statement at the end to get it as a sentence. But its a lot confusing now since "Figure 8.11" and "fig 99.2" will not be printed as I have not given a regex code for that because I cannot determine what regex I should be using and later combining it at the end.
It's probably because I'm using a wrong approach to print the given sentence from the string s. I'll be glad if anyone could help me fix the code or guide me using some alternate approach as this code looks absurd.
This is the output I get:
Consider the task in . which are balanced in .
To capture all bulleted items, I would use:
import re
s = "5. Consider the task in Figure 8.11, which are balanced in fig 99.2"
items = re.findall(r'\d+\.(?!\d)(.*?)(?=\d+\.(?!\d)|$)', s, flags=re.DOTALL)
print(items)
This prints:
['Consider the task in Figure 8.11, which are balanced in fig 99.2']
Here is an explanation of the regex pattern:
\d+\. match a bulleted number
(?!\d) which is NOT followed by another number
(.*?) match and capture all content, across newlines, until hitting
(?=\d+\.(?!\d)|$) another number bullet OR the end of the input
#TimBiegeleisen's answer works, but is somewhat verbose due to the fact that using re.findall would require repeating the pattern of the bullet point as a start and as a lookahead in the end.
For the purpose of finding strings between repeating patterns (bullet points in this case) it may be simpler to use re.split instead. Slice the resulting list to discard the first item since we don't need what comes before the first bullet point:
re.split(r'\d+\.(?!\d)\s*', s)[1:]
This returns:
['Consider the task in Figure 8.11, which are balanced in fig 99.2']

Custom NER Spacy throwing IndexError: list index out of range

I am trying to create custom NER using Spacy but, while training, I am getting the following error:
gold = GoldParse(doc, entities=entity_offsets)
File "gold.pyx", line 565, in spacy.gold.GoldParse.init
IndexError: list index out of range
Any idea as to how I can fix this?
The most common resolution that came up after doing some google search was to trim leading and trailing white spaces in the training data. So I used this code to trim them off. But still was of no use.
'''
invalid_span_tokens = re.compile(r'\s')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
text[valid_start]):
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
'''
Ah, so the words would be a keyword argument on the GoldParse object. This lets you specify the gold-standard tokenization, if it doesn't match spaCy's tokenization. Assuming your input looks like this:
text = 'helloworld'
words = ['hello', 'world']
tags = ['INTJ', 'NOUN']
You can do the following:
doc = Doc(text)
gold = GoldParse(doc, words=words, tags=tags)
nlp.update([doc], [gold])
Alternatively, you can also use the new "simple training style" and just pass in the text as a string, and the annotations as a dictionary:
nlp.update([text], [{'words': words, 'tags': tags}])
In general, we'd recommend using the simple style, as it removes one layer of abstraction, and lets you get rid of the Doc and GoldParse imports. But ultimately, the style you choose depends on your personal preference.

Spacy similarity warning : "Evaluating Doc.similarity based on empty vectors."

I'm trying to do data enhancement with a FAQ dataset. I change words, specifically nouns, by most similar words with Wordnet checking the similarity with Spacy. I use multiple for loop to go through my dataset.
import spacy
import nltk
from nltk.corpus import wordnet as wn
import pandas as pd
nlp = spacy.load('en_core_web_md')
nltk.download('wordnet')
questions = pd.read_csv("FAQ.csv")
list_questions = []
for question in questions.values:
list_questions.append(nlp(question[0]))
for question in list_questions:
for token in question:
treshold = 0.5
if token.pos_ == 'NOUN':
wordnet_syn = wn.synsets(str(token), pos=wn.NOUN)
for syn in wordnet_syn:
for lemma in syn.lemmas():
similar_word = nlp(lemma.name())
if similar_word.similarity(token) != 1. and similar_word.similarity(token) > treshold:
good_word = similar_word
treshold = token.similarity(similar_word)
However, the following warning is printed several times and I don't understand why :
UserWarning: [W008] Evaluating Doc.similarity based on empty vectors.
It is my similar_word.similarity(token) which creates the problem but I don't understand why.
The form of my list_questions is :
list_questions = [Do you have a paper or other written explanation to introduce your model's details?, Where is the BERT code come from?, How large is a sentence vector?]
I need to check token but also the similar_word in the loop, for example, I still get the error here :
tokens = nlp(u'dog cat unknownword')
similar_word = nlp(u'rabbit')
if(similar_word):
for token in tokens:
if (token):
print(token.text, similar_word.similarity(token))
You get that error message when similar_word is not a valid spacy document. E.g. this is a minimal reproducible example:
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat')
#similar_word = nlp(u'rabbit')
similar_word = nlp(u'')
for token in tokens:
print(token.text, similar_word.similarity(token))
If you change the '' to be 'rabbit' it works fine. (Cats are apparently just a fraction more similar to rabbits than dogs are!)
(UPDATE: As you point out, unknown words also trigger the warning; they will be valid spacy objects, but not have any word vector.)
So, one fix would be to check similar_word is valid, including having a valid word vector, before calling similarity():
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat')
similar_word = nlp(u'')
if(similar_word and similar_word.vector_norm):
for token in tokens:
if(token and token.vector_norm):
print(token.text, similar_word.similarity(token))
Alternative Approach:
You could suppress the particular warning. It is W008. I believe setting an environmental variable SPACY_WARNING_IGNORE=W008 before running your script would do it. (Not tested.)
(See source code)
By the way, similarity() might cause some CPU load, so is worth storing in a variable, instead of calculating it three times as you currently do. (Some people might argue that is premature optimization, but I think it might also make the code more readable.)
I have suppress the W008 warning by setting environmental variable by using this code in run file.
import os
app = Flask(__name__)
app.config['SPACY_WARNING_IGNORE'] = "W008"
os.environ["SPACY_WARNING_IGNORE"] = "W008"
if __name__ == "__main__":
app.run(host='0.0.0.0', port=5000)

How to use extract the hidden layer features in H2ODeepLearningEstimator?

I found H2O has the function h2o.deepfeatures in R to pull the hidden layer features
https://www.rdocumentation.org/packages/h2o/versions/3.20.0.8/topics/h2o.deepfeatures
train_features <- h2o.deepfeatures(model_nn, train, layer=3)
But I didn't find any example in Python? Can anyone provide some sample code?
Most Python/R API functions are wrappers around REST calls. See http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/model/model_base.html#ModelBase.deepfeatures
So, to convert an R example to a Python one, move the model to be the this, and all other args should shuffle along. I.e. the example from the manual becomes (with dots in variable names changed to underlines):
prostate_hex = ...
prostate_dl = ...
prostate_deepfeatures_layer1 = prostate_dl.deepfeatures(prostate_hex, 1)
prostate_deepfeatures_layer2 = prostate_dl.deepfeatures(prostate_hex, 2)
Sometimes the function name will change slightly (e.g. h2o.importFile() vs. h2o.import_file() so you need to hunt for it at http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/index.html

What is Natural Language Processing Doing Exactly in This Code?

I am new to natural language processing and I want to use it to write a news aggregator(in Node.js in my case). Rather than just use a prepackage framework, I want to learn the nuts and bolts and I am starting with the NLP portion. I found this one tutorial that has been the most helpful so far:
http://www.p-value.info/2012/12/howto-build-news-aggregator-in-100-loc.html
In it, the author gets the RSS feeds and loops through them looking for the elements(or fields) title and description. I know Python and understand the code. But what I don't understand is what NLP is doing here with title and description under the hood(besides scraping and tokenizing, which is apparent...and those tasks don't need a NLP).
import feedparser
import nltk
corpus = []
titles=[]
ct = -1
for feed in feeds:
d = feedparser.parse(feed)
for e in d['entries']:
words = nltk.wordpunct_tokenize(nltk.clean_html(e['description']))
words.extend(nltk.wordpunct_tokenize(e['title']))
lowerwords=[x.lower() for x in words if len(x) > 1]
ct += 1
print ct, "TITLE",e['title']
corpus.append(lowerwords)
titles.append(e['title'])
(reading your question more carefully maybe this was all already obvious to you, but it doesn't look like anything more deep or interesting is going on)
wordpunct_tokenize is set up here here (last line) as
wordpunct_tokenize = WordPunctTokenizer().tokenize
WordPunctTokenizer is implemented by this code:
class WordPunctTokenizer(RegexpTokenizer):
def __init__(self):
RegexpTokenizer.__init__(self, r'\w+|[^\w\s]+')
The heart of this is just the regular expression r'\w+|[^\w\s]+', which defines what strings are considered to be tokens by this tokenizer. There are two options, separated by the |:
\w+, that is, more than one "word" character (alphabetical or numeric)
[^\w\s]+, more than one character that is not either a "word" character or whitespace, thus this matches any string of punctuation
Here is a reference for Python regular expressions.
I have not dug into the RegexpTokenizer, but I assume is set up such that the tokenize function returns an iterator that searches a string for the first match of the regular expression, then the next, etc.

Resources