How to implement nltk stopwords on python dataframe - python-3.x

I have an excel file that contains 1000 line of text articles. I want to implement nltk stopwords (as i want to remove certain characters or words form being printed). How can i apply nltk on python dataframe. For instance: I don't want words like: a, nothing, were, the etc to be printed.
import pandas as pd
import re
import string
from nltk.corpus import stopwords
stop = stopwords.words ("a", "about", "above", "across", "after",
"afterwards",
"again", "all", "almost", "alone", "along", "already", "also",
"although", "always", "am", "among", "amongst", "amoungst", "amount",
"an",
"and", "another", "any", "anyhow", "anyone", "anything",
"anyway", "anywhere", "are", "as", "at", "be", "became",
"because", "become","becomes", "becoming", "been", , "ie",
"thereafter", "thereby", "therefore", "therein", "thereupon")
df = pd.read_excel('C:\\Users\\farid-PC\\Desktop\\Tester.xlsx')
pd.set_option('display.max_colwidth', 1000)#untruncate the unseen text
df[''] = df['Text'].apply(lambda x: ' '.join([item for item in
string.split(x) if item not in stop]))
frequency = df.Text.str.split(expand=True).stack().value_counts()# counter
T = 4000000
word_freq = frequency/T #frequency of the word occurrence in the document
print("word P(w)")
print(word_freq)
Data File(excel file):
Text
Trump will drop a bomb on North Korea
Building a wall on the U.S.-Mexico border will take literally years
Wisconsin is on pace to double the number of layoffs this year.
Says John McCain has done nothing to help the vets.
Suzanne Bonamici supports a plan that will cut choice for Medicare Advantage seniors.
When asked by a reporter whether hes at the center of a criminal scheme to violate campaign laws, Gov. Scott Walker nodded yes.
Output required:
word word_frequency
Trump 0.00256
bomb 0.0076
Wisconsin 0.00345
//the output shouldn't include the stop words or punctuation's or numbers

Have you tried something like this?
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
def filter_stopwords( sentence ):
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
example_df.apply( filter_stopwords )

You can do it like this:
from nltk.corpus import stopwords
stop = stopwords.words('english')
stops = r'\b({})\b'.format('|'.join(stop))
df = pd.DataFrame({'A': ['Some text that I wrote',
'Some more text for you']})
df['A'] = df['A'].str.replace(stops, '').str.replace('\s+', ' ')
df
# A
#0 Some text I wrote
#1 Some text

Related

How to add each element (sentence) of a list to a pandas column?

I am extracting information about chemical elements from Wikipedia. It contains sentences, and I want each sentence to be added as follows:
Molecule
Sentence1
Sentence1 and sentence2
All_sentence
MgO
this is s1.
this is s1. this is s2.
all_sentence
CaO
this is s1.
this is s1. this is s2.
all_sentence
What I've achieved so far
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
text_sentences = nlp(sumary)
sent_list = []
for sentence in text_sentences.sents:
sent_list.append(sentence.text)
#print(sent_list)
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list})
print(df.head())
The output looks like:
Molecule
Description
MgO
All sentences are here
Mgo
The Molecule columns are shown repeatedly for each line of sentence which is not correct.
Please suggest some solution
It's not clear why you would want to repeat all sentences in each column but you can get to the form you want with pivot:
import spacy
import pandas as pd
import wikipediaapi
import csv
wiki_wiki = wikipediaapi.Wikipedia('en')
chemical = input("Write the name of molecule: ")
page_py = wiki_wiki.page(chemical)
sumary = page_py.summary[0:]
nlp = spacy.load('en_core_web_sm')
sent_list = [sent.text for sent in nlp(sumary).sents]
#cumul_sent_list = [' '.join(sent_list[:i]) for i in range(1, len(sent_list)+1)] # uncomment to cumulate sentences in columns
df = pd.DataFrame(
{'Molecule': chemical,
'Description': sent_list}) # replace sent_list with cumul_sent_list if cumulating
df["Sentences"] = pd.Series([f"Sentence{i + 1}" for i in range(len(df))]) # replace "Sentence{i+1}" with "Sentence1-{i+1}" if cumulating
df = df.pivot(index="Molecule", columns="Sentences", values="Description")
print(df)
sent_list can be created using a list comprehension. Create cumul_sent_list if you want your sentences to be repeated in columns.
Output:
Sentences Sentence1 ... Sentence9
Molecule ...
MgO Magnesium oxide (MgO), or magnesia, is a white... ... According to evolutionary crystal structure pr...

Using NLTK, how to search for concepts in a text

I'm novice to both Python and NLTK. So, I'm trying to see the representation of some concepts in text using NLTK. I have a CSV file which looks like this image
And I want to see how frequent, e.g., Freedom, Courage, and all other concepts are. I also want to know how to make sure the code looks for bi and trigrams. However, the code I have below only allows me to look for a single list of words in a text (Preps.txt like this ).
The output I expect is something like:
Concept = Frequency in text, i.e., Freedom = 10, Courage = 20
import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/Users/Muhsa/Myfolder/Concepts' #this is where the texts I want to study are located
Concepts= PlaintextCorpusReader(corpus_root, '.*')
Concepts.fileids()
for fileid in Concepts.fileids():
text3 = Concepts.words(fileid)
from nltk import word_tokenize
from nltk import FreqDist
text3 = Concepts.words(fileid)
preps = open('preps.txt', encoding="utf-8")
rawpreps = preps.read() #preps refer to the file that has the list of words
tokens = word_tokenize(rawpreps)
texty = nltk.Text(tokens)
fdist = nltk.FreqDist(w.lower() for w in text3)
for m in texty:
print(m + ':', fdist[m], end=' ')
I reorganised your code a little bit. I assumed you had 1 file per concept words, and that 'preps.txt' only contained the courage words but not the others.
I hope it is easy to understand.
import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk import word_tokenize
from nltk import FreqDist
# Load the courage vocabulary
with open('preps.txt', encoding="utf-8") as file:
content = file.read() #preps refer to the file that has the list of words
courage_words = content.split('\n') # This is a list of words
# load freedom and development words in the same fashion
# Load the corpus
corpus_root = '/Users/Muhsa/Myfolder/Concepts' #this is where the texts I want to study are located
corpus = PlaintextCorpusReader(corpus_root, '.*')
# Count the number of word in the whole corpus that are also in the courage vocabulry
courage_freq = len([w for w in corpus.words() if w in courage_words])
print('Corpus contains {} courage words'.format(courage_freq))
# For each file in the corpus
for file_id in corpus.fileids():
# Count the number of word in the file that are also in courage word
file_freq = len([w for w in corpus.words(file_id) if w in courage_words])
print(file_id, file_freq)
Or better
# Load concept vocabulary in different files, in a python dictionary
concept_voc = {}
for file_path in ['courage.txt', 'freedom.txt', 'development.txt']:
concept_name = file_path.replace('.txt', '')
with open(file_path) as f:
voc = f.read().split('\n')
concept_voc[concept_name] = voc
# Load concept vocabulary in a csv file, each column is one vocabulary, the first line is the "name"
df = pd.read_csv('to_dict.csv')
convept_voc = df.to_dict('columns')
# concept_voc['courage'] returns the list of courage words
# And then for each concept compute the frequency as before
for concept in concept_voc:
voc = concept_voc[concept]
corpus_freq = len([w for w in corpus.words() if w in voc])
print(concept, '=', corpus_freq)

How to replace random elements of a list with a unique symbol?

I am a newbie to python programming. I have two lists, the first list containing stopwords while the other containing the text document. I want to replace the stop words in the text document with "/". Is there anyone that could help?
I have used the replace function, it was giving an error
text = "This is an example showing off word filtration"
stop = `set`(stopwords.words("english"))
text = nltk.word_tokenize(document)
`for` word in stop:
text = text.replace(stop, "/")
`print`(text)
It should output
"/ / / example showing / word filtration"
How about a list comprehension:
>>> from nltk.corpus import stopwords
>>> from nltk.tokenize import word_tokenize
>>> stop_words = set(stopwords.words('english'))
>>> text = "This is an example showing off word filtration"
>>> text_tokens = word_tokenize(text)
>>> replaced_text_words = ["/" if word.lower() in stop_words else word for word in text_tokens]
>>> replaced_text_words
['/', '/', '/', 'example', 'showing', '/', 'word', 'filtration']
>>> replaced_sentence = " ".join(replaced_text_words)
>>> replaced_sentence
/ / / example showing / word filtration
How about using a regex pattern?
Your code could then look like this:
from nltk.corpus import stopwords
import nltk
text = "This is an example showing off word filtration"
text = text.lower()
import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('/ ', text)
In relation to this post.
you should use word not stop in your replace function.
for word in stop:
text = text.replace(word, "/")
you can try this
' '/join([item if item.lower() not in stop else "/" for item in text ])

Removing stopwords and tokenization in python

I have following input data and I would like to remove stopwords from this input and want to do tokenization:
input = [['Hi i am going to college', 'We will meet next time possible'],
['My college name is jntu', 'I am into machine learning specialization'],
['Machine learnin is my favorite subject' ,'Here i am using python for
implementation']]
I have tried following code but not getting desired results:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(input)
filtered_sentence = [w for w in word_tokens if not w in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
#print(word_tokens)
print(filtered_sentence)
Expecting output like below:
Output = [['Hi', 'going', 'college', 'meet','next', 'time', 'possible'],
['college', 'name','jntu', 'machine', 'learning', 'specialization'],
['Machine', 'learnin', 'favorite', 'subject' ,'using', 'python', 'implementation']]
I believe this will help you.
stop_words = set(stopwords.words('english'))
op=[]
for item in _input:
word_tokens = word_tokenize(' '.join(item).lower())
filtered_sentence = [w for w in word_tokens if not w in stop_words]
op.append(filtered_sentence)
print(op)
Each item in your list has two strings. So, join them as a single string and remove the stopwords.
Start like before
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
input_ = [['Hi i am going to college', 'We will meet next time possible'],
['My college name is jntu', 'I am into machine learning specialization'],
['Machine learnin is my favorite subject' ,'Here i am using python for implementation']]
I think is is better to name your input input_ since input already has a meaning in Python.
I would start with flattening your input. Instead of a nested list of lists, we should have a single lists of sentences:
input_flatten = [sentence for sublist in input for sentence in sublist]
print(input_flatten)
>>>['Hi i am going to college',
'We will meet next time possible',
'My college name is jntu',
'I am into machine learning specialization',
'Machine learnin is my favorite subject',
'Here i am using python for implementation']
Then you can go through every sentences and remove the stop words like so:
sentences_without_stopwords = []
for sentence in input_flatten:
sentence_tokenized = word_tokenize(sentence)
stop_words_removed = [word for word in sentence_tokenized if word not in stop_words]
sentences_without_stopwords.append(stop_words_removed)
print(sentences_without_stopwords)
>>>[['Hi', 'going', 'college'],
['We', 'meet', 'next', 'time', 'possible'],
['My', 'college', 'name', 'jntu'],
['I', 'machine', 'learning', 'specialization'],
['Machine', 'learnin', 'favorite', 'subject'],
['Here', 'using', 'python', 'implementation']]

Print only topic name using LDA with python

I need to print only the topic word (only one word). But it contains some number, But I can not get only the topic name like "Happy". My String word is "Happy", why it shows "Happi"
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim import corpora, models
import gensim
import string
tokenizer = RegexpTokenizer(r'\w+')
en_stop = get_stop_words('en')
p_stemmer = PorterStemmer()
fr = open('Happy DespicableMe.txt','r')
doc_a = fr.read()
fr.close()
doc_set = [doc_a]
texts = []
for i in doc_set:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in en_stop]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
texts.append(stemmed_tokens)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=1, id2word = dictionary, passes=20)
rafa = ldamodel.show_topics(num_topics=1, num_words=1, log=False , formatted=False)
print(rafa)
It only shows [(0, '0.142*"happi"')]. But I want to print only the word.
You are plagued by a misunderstanding:
Stemming extracts the stem of a word through a series of transformation rules stripping off common suffixes and prefixes. Indeed, the resulting stem is not necessarily an actual English word. The purpose use of stemming is to normalize words for comparison. E.g.
stem_word('happy') == stem_word('happier')
What you need is a Lemmatizer (e.g. nltk.stem.wordnet) to lookup lemmas. Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word.
After you have install the corpus/wordnet you can use it like this:
from nltk.corpus import wordnet
syns = wordnet.synsets("happier")
print(syns[0].lemmas()[0].name())
Output:
happy

Resources