Below is the code
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
for w in Wrd_Freq:
print(ps.stem(w))
Output
read
peopl
say
work
I need the output as
['read',
'people',
'say',
'work']
Full Code without Potter Stemmer
lower = []
for item in df_text['job_description']:
lower.append(item.lower()) # lowercase description
tokens = []
type(tokens)
token_string= [str(i) for i in lower]
string = "".join(token_string)
string = string.replace("-","")
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r"\W+", gaps=True)
tokens = tokenizer.tokenize(string)
tokens
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
tokens = [token for token in tokens if token not in stopwords_list]
tokens
freq6000 = []
Wrd_Freq = nltk.FreqDist(tokens)
Wrd_Freq
df_WrdFreq = pd.DataFrame.from_dict(Wrd_Freq, orient='Index')
df_WrdFreq.columns=['Word Frequency']
freq6000= df_WrdFreq[(df_WrdFreq['Word Frequency'] >= 6000)]
freq6000.sort_values(by=['Word Frequency'],ascending=False).head(10)
I need to use potter stemmer separately to check whether there is any change to the count list. I need to perform the same after including potter stemmer and compare the output.
Use list comprehension:
L= [ps.stem(w) for w in Wrd_Freq]
EDIT:
If need top values by counts:
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
tokens = [token for token in tokens if token not in stopwords_list]
tokens
freq6000 = []
Wrd_Freq = nltk.FreqDist(tokens)
from collections import Counter
c = Counter(tokens)
top = [x for x, y in c.most_common(10)]
print (top)
['data', 'experience', 'business', 'work', 'science',
'learning', 'analytics', 'team', 'analysis', 'machine']
df_WrdFreq = pd.DataFrame.from_dict(Wrd_Freq, orient='Index')
df_WrdFreq.columns=['Word Frequency']
freq6000= df_WrdFreq[(df_WrdFreq['Word Frequency'] >= 6000)]
df = freq6000.sort_values(by=['Word Frequency'],ascending=False).head(10)
print (df)
Word Frequency
data 124289
experience 59135
business 33528
work 28146
science 26864
learning 26850
analytics 21828
team 20825
analysis 20607
machine 20484
Related
I am extracting information about chemical elements from Wikipedia. It contains sentences, and I want each sentence to be added as follows:
Molecule
Sentence1
Sentence1 and sentence2
All_sentence
MgO
this is s1.
this is s1. this is s2.
all_sentence
CaO
this is s1.
this is s1. this is s2.
all_sentence
What I've achieved so far
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
text_sentences = nlp(sumary)
sent_list = []
for sentence in text_sentences.sents:
sent_list.append(sentence.text)
#print(sent_list)
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list})
print(df.head())
The output looks like:
Molecule
Description
MgO
All sentences are here
Mgo
The Molecule columns are shown repeatedly for each line of sentence which is not correct.
Please suggest some solution
It's not clear why you would want to repeat all sentences in each column but you can get to the form you want with pivot:
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
sent_list = [sent.text for sent in nlp(sumary).sents]
#cumul_sent_list = [' '.join(sent_list[:i]) for i in range(1, len(sent_list)+1)] # uncomment to cumulate sentences in columns
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list}) # replace sent_list with cumul_sent_list if cumulating
df["Sentences"] = pd.Series([f"Sentence{i + 1}" for i in range(len(df))]) # replace "Sentence{i+1}" with "Sentence1-{i+1}" if cumulating
df = df.pivot(index="Molecule", columns="Sentences", values="Description")
print(df)
sent_list can be created using a list comprehension. Create cumul_sent_list if you want your sentences to be repeated in columns.
Output:
Sentences Sentence1 ... Sentence9
Molecule ...
MgO Magnesium oxide (MgO), or magnesia, is a white... ... According to evolutionary crystal structure pr...
I am writing a method that returns cosine similarity between two documents. Using sklearn CountVectorizer()
I have tried
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def doc_cos_similar(doc1:str, doc2:str) -> float:
vectorizer= CountVectorizer()
doc1="Good morning"
doc2="Good evening"
documents = [doc1, doc2]
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(documents)
doc_term_matrix = sparse_matrix.todense()
return doc_term_matrix
#input
doc1="Good morning"
doc2="Good afternoon"
the
output should be 0.60(something alike)
But the output is a
matrix([[0, 1, 1],
[1, 1, 0]])
You're almost there.
cosine_similarity(doc_term_matrix) returns
array([[1. , 0.5],
[0.5, 1. ]])
So you can use cosine_similarity(doc_term_matrix)[0][1] (or [1][0], it doesn't matter since cosine is symmetric).
P.S. you should pass doc1 and doc2 as arguments, instead of hard-coding them.
You can try this:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# X = input("Enter first string: ").lower()
# Y = input("Enter second string: ").lower()
X ="Good morning! Welcome"
Y ="Good evening! Welcome"
# tokenization
X_list = word_tokenize(X)
Y_list = word_tokenize(Y)
# sw contains the list of stopwords
sw = stopwords.words('english')
l1 =[];l2 =[]
# remove stop words from the string
X_set = {w for w in X_list if not w in sw}
Y_set = {w for w in Y_list if not w in sw}
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in rvector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0
# cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
print("similarity: ", cosine)
I am trying to understand the math behind the TfidfVectorizer. I used this tutorial, but my code is a little bit changed:
what also says at the end that The values differ slightly because sklearn uses a smoothed version idf and various other little optimizations.
I want to be able to use TfidfVectorizer but also calculate the same simple sample by my hand.
Here is my whole code:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
def main():
documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'
corpus = [documentA, documentB]
bagOfWordsA = documentA.split(' ')
bagOfWordsB = documentB.split(' ')
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))
print('----------- compare word count -------------------')
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
numOfWordsA[word] += 1
numOfWordsB = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsB:
numOfWordsB[word] += 1
tfA = computeTF(numOfWordsA, bagOfWordsA)
tfB = computeTF(numOfWordsB, bagOfWordsB)
print(pd.DataFrame([tfA, tfB]))
CV = CountVectorizer(stop_words=None, token_pattern='(?u)\\b\\w\\w*\\b')
cv_ft = CV.fit_transform(corpus)
tt = TfidfTransformer(use_idf=False, norm='l1')
t = tt.fit_transform(cv_ft)
print(pd.DataFrame(t.todense().tolist(), columns=CV.get_feature_names()))
print('----------- compare idf -------------------')
idfs = computeIDF([numOfWordsA, numOfWordsB])
print(pd.DataFrame([idfs]))
tfidfA = computeTFIDF(tfA, idfs)
tfidfB = computeTFIDF(tfB, idfs)
print(pd.DataFrame([tfidfA, tfidfB]))
ttf = TfidfTransformer(use_idf=True, smooth_idf=False, norm=None)
f = ttf.fit_transform(cv_ft)
print(pd.DataFrame(f.todense().tolist(), columns=CV.get_feature_names()))
print('----------- TfidfVectorizer -------------------')
vectorizer = TfidfVectorizer(smooth_idf=False, use_idf=True, stop_words=None, token_pattern='(?u)\\b\\w\\w*\\b', norm=None)
vectors = vectorizer.fit_transform([documentA, documentB])
feature_names = vectorizer.get_feature_names()
print(pd.DataFrame(vectors.todense().tolist(), columns=feature_names))
def computeTF(wordDict, bagOfWords):
tfDict = {}
bagOfWordsCount = len(bagOfWords)
for word, count in wordDict.items():
tfDict[word] = count / float(bagOfWordsCount)
return tfDict
def computeIDF(documents):
import math
N = len(documents)
idfDict = dict.fromkeys(documents[0].keys(), 0)
for document in documents:
for word, val in document.items():
if val > 0:
idfDict[word] += 1
for word, val in idfDict.items():
idfDict[word] = math.log(N / float(val))
return idfDict
def computeTFIDF(tfBagOfWords, idfs):
tfidf = {}
for word, val in tfBagOfWords.items():
tfidf[word] = val * idfs[word]
return tfidf
if __name__ == "__main__":
main()
I can compare calculation of Term Frequency. Both results look the same. But when I calculate the IDF and then TF-IDF there are differences between the code from the website and TfidfVectorizer (I also try combination of CountVectorizer and TfidfTransformer to be sure it returns the same results like TfidfVectorizer does).
Code Tf-Idf results:
TfidfVectorizer Tf-Idf results:
Can anybody help me with a code that would return the same returns as TfidfVectorizer or setting of TfidfVectorizer what would return the same results as the code above?
Here is my improvisation of your code to reproduce TfidfVectorizer output for your data .
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from IPython.display import display
documentA = 'the man went out for a walk'
documentB = 'the children sat around the fire'
corpus = [documentA, documentB]
bagOfWordsA = documentA.split(' ')
bagOfWordsB = documentB.split(' ')
uniqueWords = set(bagOfWordsA).union(set(bagOfWordsB))
print('----------- compare word count -------------------')
numOfWordsA = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsA:
numOfWordsA[word] += 1
numOfWordsB = dict.fromkeys(uniqueWords, 0)
for word in bagOfWordsB:
numOfWordsB[word] += 1
series_A = pd.Series(numOfWordsA)
series_B = pd.Series(numOfWordsB)
df = pd.concat([series_A, series_B], axis=1).T
df = df.reindex(sorted(df.columns), axis=1)
display(df)
tf_df = df.divide(df.sum(1),axis='index')
n_d = 1+ tf_df.shape[0]
df_d_t = 1 + (tf_df.values>0).sum(0)
idf = np.log(n_d/df_d_t) + 1
pd.DataFrame(df.values * idf,
columns=df.columns )
tfidf = TfidfVectorizer(token_pattern='(?u)\\b\\w\\w*\\b', norm=None)
pd.DataFrame(tfidf.fit_transform(corpus).todense(),
columns=tfidf.get_feature_names() )
More details on the implementation refer the documentation here.
I have following input data and I would like to remove stopwords from this input and want to do tokenization:
input = [['Hi i am going to college', 'We will meet next time possible'],
['My college name is jntu', 'I am into machine learning specialization'],
['Machine learnin is my favorite subject' ,'Here i am using python for
implementation']]
I have tried following code but not getting desired results:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(input)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
#print(word_tokens)
print(filtered_sentence)
Expecting output like below:
Output = [['Hi', 'going', 'college', 'meet','next', 'time', 'possible'],
['college', 'name','jntu', 'machine', 'learning', 'specialization'],
['Machine', 'learnin', 'favorite', 'subject' ,'using', 'python', 'implementation']]
I believe this will help you.
stop_words = set(stopwords.words('english'))
op=[]
for item in _input:
word_tokens = word_tokenize(' '.join(item).lower())
filtered_sentence = [w for w in word_tokens if not w in stop_words]
op.append(filtered_sentence)
print(op)
Each item in your list has two strings. So, join them as a single string and remove the stopwords.
Start like before
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
input_ = [['Hi i am going to college', 'We will meet next time possible'],
['My college name is jntu', 'I am into machine learning specialization'],
['Machine learnin is my favorite subject' ,'Here i am using python for implementation']]
I think is is better to name your input input_ since input already has a meaning in Python.
I would start with flattening your input. Instead of a nested list of lists, we should have a single lists of sentences:
input_flatten = [sentence for sublist in input for sentence in sublist]
print(input_flatten)
>>>['Hi i am going to college',
'We will meet next time possible',
'My college name is jntu',
'I am into machine learning specialization',
'Machine learnin is my favorite subject',
'Here i am using python for implementation']
Then you can go through every sentences and remove the stop words like so:
sentences_without_stopwords = []
for sentence in input_flatten:
sentence_tokenized = word_tokenize(sentence)
stop_words_removed = [word for word in sentence_tokenized if word not in stop_words]
sentences_without_stopwords.append(stop_words_removed)
print(sentences_without_stopwords)
>>>[['Hi', 'going', 'college'],
['We', 'meet', 'next', 'time', 'possible'],
['My', 'college', 'name', 'jntu'],
['I', 'machine', 'learning', 'specialization'],
['Machine', 'learnin', 'favorite', 'subject'],
['Here', 'using', 'python', 'implementation']]
I need to print only the topic word (only one word). But it contains some number, But I can not get only the topic name like "Happy". My String word is "Happy", why it shows "Happi"
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import string
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
fr = open('Happy DespicableMe.txt','r')
doc_a = fr.read()
fr.close()
doc_set = [doc_a]
texts = []
for i in doc_set:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in en_stop]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
texts.append(stemmed_tokens)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=1, id2word = dictionary, passes=20)
rafa = ldamodel.show_topics(num_topics=1, num_words=1, log=False , formatted=False)
print(rafa)
It only shows [(0, '0.142*"happi"')]. But I want to print only the word.
You are plagued by a misunderstanding:
Stemming extracts the stem of a word through a series of transformation rules stripping off common suffixes and prefixes. Indeed, the resulting stem is not necessarily an actual English word. The purpose use of stemming is to normalize words for comparison. E.g.
stem_word('happy') == stem_word('happier')
What you need is a Lemmatizer (e.g. nltk.stem.wordnet) to lookup lemmas. Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word.
After you have install the corpus/wordnet you can use it like this:
from nltk.corpus import wordnet
syns = wordnet.synsets("happier")
print(syns[0].lemmas()[0].name())
Output:
happy