Include keywords dictionary in document classification using PySpark code - apache-spark

I am trying to perform a document classification using PySpark.
I am using the below steps for that:
# Tokenizer
tokenizer = Tokenizer().setInputCol("text").setOutputCol("words")
# Stop Word removal
updated_stopwords_list = list(set(StopWordsRemover().getStopWords() +
custom_stopword_list))
remover_custom = StopWordsRemover(inputCol="words", outputCol="filtered",
stopWords=updated_stopwords_list)
# HashingTF
hashingTF = HashingTF().setNumFeatures(1000).setInputCol("filtered").setOutputCol("rawFeatures")
# IDF
idf =
IDF().setInputCol("rawFeatures").setOutputCol("features").setMinDocFreq(0)
pipeline=Pipeline(stages=[tokenizer,remover_custom,hashingTF,idf])
And I am using it in a pipeline.
Now, Here after the removing the stop words, I want to include a keyword dictionary(data dictionary) so that it will select the words from the array(out put of stopword remover is an array of words) which is present in that dictionary.
Can anyone please guide me in how to do this? I am reading the keyword dictionary from a CSV file.

If you're not required to use HashingTF here is one option using the CountVectorizer by forcing the vocabulary to be your keywords list:
# Prepare keyword list to go into CountVectorizer. Can also use Tokenizer if your keywords are only single words
str_to_arr_udf = udf(lambda s: [s], ArrayType(StringType()))
# Fit CountVectorizer to keywords list so vocabulary = keywords
keyword_df = spark.read.format("csv").load(csv_file)
keyword_df.withColumn("filtered", str_to_arr_udf("keyword"))
cv = CountVectorizer(inputCol="filtered", outputCol="filtered_only_keywords", binary=True)
cvm = cv.fit(keyword_df)
# Transform the actual dataframe
cv.transform(df_output_from_stopwords)
Otherwise the udf route is always an option. Something like:
keyword_list = [x.word for x in spark.read.load(file).collect()]
keep_words_udf = udf(lambda word_list: [ word for word in word_list if word in keyword_list], ArrayType(StringType()) )
Assuming this key words list does not contain any words in the StopWordsRemover list, the StopWordsRemover step is actually unnecessary.

Related

How to calculate TF-IDF values of noun documents excluding spaCy stop words?

I have a data frame, df with text, cleaned_text, and nouns as column names. text and cleaned_text contains string document, nouns is a list of nouns extracted from cleaned_text column. df.shape = (1927, 3).
I am trying to calculate TF-IDF values for all documents within df only for nouns, excluding spaCy stopwords.
What I have tried?
import spacy
from spacy.lang.en import English
nlp = spacy.load('en_core_web_sm')
# subclass to modify stop word lists recommended from spaCy version 3.0 onwards
excluded_stop_words = {'down'}
included_stop_words = {'dear', 'regards'}
class CustomEnglishDefaults(English.Defaults):
stop_words = English.Defaults.stop_words.copy()
stop_words -= excluded_stop_words
stop_words |= included_stop_words
class CustomEnglish(English):
Defaults = CustomEnglishDefaults
# function to extract nouns from cleaned_text column, excluding spaCy stowords.
nlp = CustomEnglish()
def nouns(text):
doc = nlp(text)
return [t for t in doc if t.pos_ in ['NOUN'] and not t.is_stop and not t.is_punct]
# calculate TF-IDF values for nouns, excluding spaCy stopwords.
from sklearn.feature_extraction.text import TfidfVectorizer
documents = df.cleaned_text
tfidf = TfidfVectorizer(stop_words=CustomEnglish)
X = tfidf.fit_transform(documents)
What I am expecting?
I am expecting to have an output as a list of tuples ranked in descending order;
nouns = [('noun_1', tf-idf_1), ('noun_2', tf-idf_2), ...]. All nouns in nouns should match those of df.nouns (this is to check whether I am on the right way).
What is my issue?
I got confused about how to apply TfidfVectorizer such that to calculate only TF-IDF values for Nouns extracted from cleaned_text. I am also not sure whether SkLearn TfidfVectorizer can calculate TF-IDF as I am expecting.
Not sure if you're still looking for a solution. Here is an option that you might wanna go ahead with.
First of all, by default TF_IDF takes into account the entire set of words, not just nouns. Hence, you would need to implement a custom TF_IDF function to apply results only on nouns. Following is a good reference on how TF_IDF works internally: https://www.askpython.com/python/examples/tf-idf-model-from-scratch
Instead of running the tf_idf function(as applied in the above url) for all words of a sentence/document, you can just run it on the list of nouns you've extracted,i.e., just change the code from:
def tf_idf(sentence):
tf_idf_vec = np.zeros((len(word_set),))
for word in sentence:
tf = termfreq(sentence,word)
idf = inverse_doc_freq(word)
value = tf*idf
tf_idf_vec[index_dict[word]] = value
return tf_idf_vec
to:
def tf_idf(sentence, nouns):
values = []
for word in nouns:
tf = termfreq(sentence,word)
idf = inverse_doc_freq(word)
value = tf*idf
values.append(value)
return tf_idf_vec, values
You now have a "values" list corresponding to the list of "nouns" for each sentence. Hope this makes sense.

Inefficient tokenization leading to better results

I am following the code from here.
I have a csv file of 8000 questions and answers and I have made an LSI model with 1000 topics, from a tfidf corpus, using gensim as follows. I only consider questions as part of the text not the answers.
texts = [jieba.lcut(text) for text in document]
# tk = WhitespaceTokenizer()
# texts = [tk.tokenize(text) for text in document]
dictionary = corpora.Dictionary(texts)
feature_cnt = len(dictionary.token2id)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
tfidf_corpus = tfidf[corpus]
lsi_model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=1000)
corpus_lsi = lsi_model[tfidf_corpus]
index = similarities.SparseMatrixSimilarity(corpus_lsi, num_features = feature_cnt)
Before this I am also preprocessing the data by removing stopwords using nltk and replacing punctuations using regex and lemmatizing, using wordnet and nltk.
I understand that jieba is not a tokenizer suited for english because it tokenizes spaces as well like this:
Sample: This is untokenized text
Tokenized: 'This',' ','is',' ','untokenized', ' ', 'text'
When I switch from jieba to nltk whitespace tokenizer, strange thing happens, my accuracy suddenly drops that is when I a new sentence using the following code I get worse results
keyword = "New sentence the similarity of which is to be found to the main corpus"
kw_vector = dictionary.doc2bow(jieba.lcut(keyword)) # jieba.lcut can be replaced by tk.tokenize()
sim = index[lsi_model[tfidf[kw_vector]]]
x = [sim[i] for i in np.argsort(sim)[-2:]]
My understanding is that extra and useless words and characters like whitespaces should decrease accuracy but here I observe an opposite effect. What could be the possible reasons?
One possible explanation I came up with is that most of the questions are short only 5 to 6 words like
What is the office address?
Who to contact for X?
Where to find document Y?

how to use tokens with sklearn in LDA

i have a list of tokenized documents,containing both unigrams, bi-grams and i would like to perform sklearn lda on it.i have tried the following code:
my_data =[['low-rank matrix','detection method','problem finding'],['probabilistic inference','problem finding','statistical learning','solution' ],['detection method','probabilistic inference','population','language']...]
tf_vectorizer = CountVectorizer(min_df=2, max_features=n_features,
stop_words='english')
tf = tf_vectorizer.fit_transform(mydata)
lda = LatentDirichletAllocation(n_topics=3, max_iter=5,random_state=10)
but when i print the output i get something like this:
topic 0:
detection,finding, solution ,method,problem
topic 1:
language, statistical , problem, learning,finding
and so on..
bigrams are broken and are separated from one another.i have 10,000 documents and already tokenize them, also the method for finding the bigram is not nltk based so i already did this.
is there any method to improve this without changing the input?
i am very new in using sklearn so apologies in advance if i am making some obvious mistake.
CountVectorizer has a ngram_range param which will be used for deciding if the vocabulary will contain uniqrams, or bigrams or trigrams etc:-
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the
range of n-values for different n-grams to be extracted. All values of
n such that min_n <= n <= max_n will be used.
For example:
ngram_range=(1,1) => Will include only unigrams
ngram_range=(1,2) => Will include unigrams and bigrams
ngram_range=(2,2) => Will include only bigrams
and so on...
You have not defined that, so default ngram_range=(1,1) and hence only unigrams are used here.
tf_vectorizer = CountVectorizer(min_df=2,
max_features=n_features,
stop_words='english',
ngram_range = (2,2)) # You need this
tf = tf_vectorizer.fit_transform(my_data)
Secondly, you say that you have already tokenize the data and show the lists of list (my_data) in your code. That doesnt work with CountVectorizer. For that, you need to pass a simple list of strings and CountVectorizer will automatically apply tokenizing on them. So you will need to pass on your own preprocessing steps to that. See other params 'preprocessor', 'tokenizer' and 'analyzer' in the linked documentation.

What is the use of 'max_features' in TfidfVectorizer

What I have understood from it is, If max_feature = n; It means that it is selecting the top n Feature on the basis of Tf-Idf value. I went through the Documentation of TfidfVectorizer on scikit-learn but didn't understand it properly.
If you want row-wise words which have the highest tfidf values, then you need to access the transformed tf-idf matrix from Vectorizer, access it row by row (doc by doc) and then sort the values to get those.
Something like this:
# TfidfVectorizer will by default output a sparse matrix
tfidf_data = tfidf_vectorizer.fit_transform(text_data).tocsr()
vocab = np.array(tfidf_vectorizer.get_feature_names())
# Replace this with the number of top words you want to get in each row
top_n_words = 5
# Loop all the docs present
for i in range(tfidf_data.shape[0]):
doc = tfidf_data.getrow(i).toarray().ravel()
sorted_index = np.argsort(doc)[::-1][:top_n_words]
print(sorted_index)
for word, tfidf in zip(vocab[sorted_index], doc[sorted_index]):
print("%s - %f" %(word, tfidf))
If you can use pandas, then the logic becomes simpler:
for i in range(tfidf_data.shape[0]):
doc_data = pd.DataFrame({'Tfidf':tfidf_data.getrow(i).toarray().ravel(),
'Word': vocab})
doc_data.sort_values(by='Tfidf', ascending=False, inplace=True)
print(doc_data.iloc[:top_n_words])

How to use the Scikit learn CountVectorizer?

I have a set of words for which I have to check whether they are present in the documents.
WordList = [w1, w2, ..., wn]
Another set have list of documents on which I have to check whether these words are present or not.
How to use scikit-learn CountVectorizer so that the features of term-document matrix are only words from WordList and each row represents each particular document with no of times the word from the given list appears in their respective column?
Ok. I get it.
The code is given below:
from sklearn.feature_extraction.text import CountVectorizer
# Counting the no of times each word(Unigram) appear in document.
vectorizer = CountVectorizer(input='content',binary=False,ngram_range=(1,1))
# First set the vocab
vectorizer = vectorizer.fit(WordList)
# Now transform the text contained in each document i.e list of text
Document:list
tfMatrix = vectorizer.transform(Document_List).toarray()
This will output only the term-document matrix with features from wordList only.
For custom documents, you can use Count Vectorizer approach
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer() #make object of Count Vectorizer
corpus = [
'This is a cat.',
'It likes to roam in the garden',
'It is black in color',
'The cat does not like the dog.',
]
X = vectorizer.fit_transform(corpus)
#print(X) to see count given to words
vectorizer.get_feature_names() == (
['cat', 'color', 'roam', 'The', 'garden',
'dog', 'black', 'like', 'does', 'not',
'the', 'in', 'likes'])
X.toarray()
#used to convert X into numpy array
vectorizer.transform(['A new cat.']).toarray()
# Checking it for a new document
Other Vectorizers can also be used like Tfidf Vectorizer. Tfidf vectorizer is a better approach as it not only provides with the number of occurences of words in a particular document but also tells about the importance of the word.
It is calculated by finding TF- term frequency and IDF- Inverse Document Frequency.
Term Freq is the number of times a word appeared in a particular document and IDF is calculated based on the context of the document.
For eg., if the documents are related to football, then the word "the" would not give any insight but the word "messi" would tell about the context of the document.
It is calculated by taking log of the number of occurences.
Eg. tf("the") = 10
tf("messi") = 5
idf("the") = log(10) = 0
idf("messi") = log(5) = 0.52
tfidf("the") = tf("the") * idf("the") = 10 * 0 = 0
tfidf("messi") = 5 * 0.52 = 2.6
These weights help the algorithm to identify the important words out of the documents that later helps to derive semantics out of the doc.

Resources