Removing stopwords and tokenization in python - python-3.x

I have following input data and I would like to remove stopwords from this input and want to do tokenization:
input = [['Hi i am going to college', 'We will meet next time possible'],
['My college name is jntu', 'I am into machine learning specialization'],
['Machine learnin is my favorite subject' ,'Here i am using python for
implementation']]
I have tried following code but not getting desired results:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(input)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
#print(word_tokens)
print(filtered_sentence)
Expecting output like below:
Output = [['Hi', 'going', 'college', 'meet','next', 'time', 'possible'],
['college', 'name','jntu', 'machine', 'learning', 'specialization'],
['Machine', 'learnin', 'favorite', 'subject' ,'using', 'python', 'implementation']]

I believe this will help you.
stop_words = set(stopwords.words('english'))
op=[]
for item in _input:
word_tokens = word_tokenize(' '.join(item).lower())
filtered_sentence = [w for w in word_tokens if not w in stop_words]
op.append(filtered_sentence)
print(op)
Each item in your list has two strings. So, join them as a single string and remove the stopwords.

Start like before
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
input_ = [['Hi i am going to college', 'We will meet next time possible'],
['My college name is jntu', 'I am into machine learning specialization'],
['Machine learnin is my favorite subject' ,'Here i am using python for implementation']]
I think is is better to name your input input_ since input already has a meaning in Python.
I would start with flattening your input. Instead of a nested list of lists, we should have a single lists of sentences:
input_flatten = [sentence for sublist in input for sentence in sublist]
print(input_flatten)
>>>['Hi i am going to college',
'We will meet next time possible',
'My college name is jntu',
'I am into machine learning specialization',
'Machine learnin is my favorite subject',
'Here i am using python for implementation']
Then you can go through every sentences and remove the stop words like so:
sentences_without_stopwords = []
for sentence in input_flatten:
sentence_tokenized = word_tokenize(sentence)
stop_words_removed = [word for word in sentence_tokenized if word not in stop_words]
sentences_without_stopwords.append(stop_words_removed)
print(sentences_without_stopwords)
>>>[['Hi', 'going', 'college'],
['We', 'meet', 'next', 'time', 'possible'],
['My', 'college', 'name', 'jntu'],
['I', 'machine', 'learning', 'specialization'],
['Machine', 'learnin', 'favorite', 'subject'],
['Here', 'using', 'python', 'implementation']]

Related

While lemmatizing the corpus and splitting and joining it , it shows Word List Corpus Reader not callable error

My code where I get error goes as follows :
import re
corpus = []
for i in range(len(sentences)):
review = re.sub('[^a-zA-z]', ' ',sentences[i])
review = review.lower()
review = review.split()
review = [lemmatize.lemmatize(word) for word in review if not word in # getting error
# in this statement
set(stopwords('english'))]
review = ' '.join(review)
corpus.append(review)
I am unable to find what Word List Corpus Reader is and not knowing how to use it, just saw this using tutorials. What is the correct syntax and how do I resolve this error?
Replace stopwords('english') with stopwords.words('english') and it will work.
This issue is you're calling stopwords which is an object of WordListCorpusReader and is not callable. You can call it's method
>>> type(stopwords)
<class 'nltk.corpus.reader.wordlist.WordListCorpusReader'>
See the code snippet for your case
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
lemmatize = WordNetLemmatizer()
corpus = []
sentences = [
"This is sentence one",
"Moving on to the sentence two"
]
for i in range(len(sentences)):
review = re.sub('[^a-zA-z]', ' ',sentences[i])
review = review.lower()
review = review.split()
review = [lemmatize.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
print(corpus)
Output will be like
['sentence one', 'moving sentence two']

How to add each element (sentence) of a list to a pandas column?

I am extracting information about chemical elements from Wikipedia. It contains sentences, and I want each sentence to be added as follows:
Molecule
Sentence1
Sentence1 and sentence2
All_sentence
MgO
this is s1.
this is s1. this is s2.
all_sentence
CaO
this is s1.
this is s1. this is s2.
all_sentence
What I've achieved so far
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
text_sentences = nlp(sumary)
sent_list = []
for sentence in text_sentences.sents:
sent_list.append(sentence.text)
#print(sent_list)
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list})
print(df.head())
The output looks like:
Molecule
Description
MgO
All sentences are here
Mgo
The Molecule columns are shown repeatedly for each line of sentence which is not correct.
Please suggest some solution
It's not clear why you would want to repeat all sentences in each column but you can get to the form you want with pivot:
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
sent_list = [sent.text for sent in nlp(sumary).sents]
#cumul_sent_list = [' '.join(sent_list[:i]) for i in range(1, len(sent_list)+1)] # uncomment to cumulate sentences in columns
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list}) # replace sent_list with cumul_sent_list if cumulating
df["Sentences"] = pd.Series([f"Sentence{i + 1}" for i in range(len(df))]) # replace "Sentence{i+1}" with "Sentence1-{i+1}" if cumulating
df = df.pivot(index="Molecule", columns="Sentences", values="Description")
print(df)
sent_list can be created using a list comprehension. Create cumul_sent_list if you want your sentences to be repeated in columns.
Output:
Sentences Sentence1 ... Sentence9
Molecule ...
MgO Magnesium oxide (MgO), or magnesia, is a white... ... According to evolutionary crystal structure pr...

How to covert output into list and store it?

Below is the code
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()
for w in Wrd_Freq:
print(ps.stem(w))
Output
read
peopl
say
work
I need the output as
['read',
'people',
'say',
'work']
Full Code without Potter Stemmer
lower = []
for item in df_text['job_description']:
lower.append(item.lower()) # lowercase description
tokens = []
type(tokens)
token_string= [str(i) for i in lower]
string = "".join(token_string)
string = string.replace("-","")
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r"\W+", gaps=True)
tokens = tokenizer.tokenize(string)
tokens
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
tokens = [token for token in tokens if token not in stopwords_list]
tokens
freq6000 = []
Wrd_Freq = nltk.FreqDist(tokens)
Wrd_Freq
df_WrdFreq = pd.DataFrame.from_dict(Wrd_Freq, orient='Index')
df_WrdFreq.columns=['Word Frequency']
freq6000= df_WrdFreq[(df_WrdFreq['Word Frequency'] >= 6000)]
freq6000.sort_values(by=['Word Frequency'],ascending=False).head(10)
I need to use potter stemmer separately to check whether there is any change to the count list. I need to perform the same after including potter stemmer and compare the output.
Use list comprehension:
L= [ps.stem(w) for w in Wrd_Freq]
EDIT:
If need top values by counts:
from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
tokens = [token for token in tokens if token not in stopwords_list]
tokens
freq6000 = []
Wrd_Freq = nltk.FreqDist(tokens)
from collections import Counter
c = Counter(tokens)
top = [x for x, y in c.most_common(10)]
print (top)
['data', 'experience', 'business', 'work', 'science',
'learning', 'analytics', 'team', 'analysis', 'machine']
df_WrdFreq = pd.DataFrame.from_dict(Wrd_Freq, orient='Index')
df_WrdFreq.columns=['Word Frequency']
freq6000= df_WrdFreq[(df_WrdFreq['Word Frequency'] >= 6000)]
df = freq6000.sort_values(by=['Word Frequency'],ascending=False).head(10)
print (df)
Word Frequency
data 124289
experience 59135
business 33528
work 28146
science 26864
learning 26850
analytics 21828
team 20825
analysis 20607
machine 20484

How to implement nltk stopwords on python dataframe

I have an excel file that contains 1000 line of text articles. I want to implement nltk stopwords (as i want to remove certain characters or words form being printed). How can i apply nltk on python dataframe. For instance: I don't want words like: a, nothing, were, the etc to be printed.
import pandas as pd
import re
import string
from nltk.corpus import stopwords
stop = stopwords.words ("a", "about", "above", "across", "after",
"afterwards",
"again", "all", "almost", "alone", "along", "already", "also",
"although", "always", "am", "among", "amongst", "amoungst", "amount",
"an",
"and", "another", "any", "anyhow", "anyone", "anything",
"anyway", "anywhere", "are", "as", "at", "be", "became",
"because", "become","becomes", "becoming", "been", , "ie",
"thereafter", "thereby", "therefore", "therein", "thereupon")
df = pd.read_excel('C:\\Users\\farid-PC\\Desktop\\Tester.xlsx')
pd.set_option('display.max_colwidth', 1000)#untruncate the unseen text
df[''] = df['Text'].apply(lambda x: ' '.join([item for item in
string.split(x) if item not in stop]))
frequency = df.Text.str.split(expand=True).stack().value_counts()# counter
T = 4000000
word_freq = frequency/T #frequency of the word occurrence in the document
print("word P(w)")
print(word_freq)
Data File(excel file):
Text
Trump will drop a bomb on North Korea
Building a wall on the U.S.-Mexico border will take literally years
Wisconsin is on pace to double the number of layoffs this year.
Says John McCain has done nothing to help the vets.
Suzanne Bonamici supports a plan that will cut choice for Medicare Advantage seniors.
When asked by a reporter whether hes at the center of a criminal scheme to violate campaign laws, Gov. Scott Walker nodded yes.
Output required:
word word_frequency
Trump 0.00256
bomb 0.0076
Wisconsin 0.00345
//the output shouldn't include the stop words or punctuation's or numbers
Have you tried something like this?
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
def filter_stopwords( sentence ):
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
example_df.apply( filter_stopwords )
You can do it like this:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stops = r'\b({})\b'.format('|'.join(stop))
df = pd.DataFrame({'A': ['Some text that I wrote',
'Some more text for you']})
df['A'] = df['A'].str.replace(stops, '').str.replace('\s+', ' ')
df
# A
#0 Some text I wrote
#1 Some text

Print only topic name using LDA with python

I need to print only the topic word (only one word). But it contains some number, But I can not get only the topic name like "Happy". My String word is "Happy", why it shows "Happi"
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import string
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
fr = open('Happy DespicableMe.txt','r')
doc_a = fr.read()
fr.close()
doc_set = [doc_a]
texts = []
for i in doc_set:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in en_stop]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
texts.append(stemmed_tokens)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=1, id2word = dictionary, passes=20)
rafa = ldamodel.show_topics(num_topics=1, num_words=1, log=False , formatted=False)
print(rafa)
It only shows [(0, '0.142*"happi"')]. But I want to print only the word.
You are plagued by a misunderstanding:
Stemming extracts the stem of a word through a series of transformation rules stripping off common suffixes and prefixes. Indeed, the resulting stem is not necessarily an actual English word. The purpose use of stemming is to normalize words for comparison. E.g.
stem_word('happy') == stem_word('happier')
What you need is a Lemmatizer (e.g. nltk.stem.wordnet) to lookup lemmas. Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word.
After you have install the corpus/wordnet you can use it like this:
from nltk.corpus import wordnet
syns = wordnet.synsets("happier")
print(syns[0].lemmas()[0].name())
Output:
happy

Resources