I am trying to remove stopwords from my data and I have used this statement to download the stopwords.
stop = set(stopwords.words('english'))
This has character 'd' as one of the stopwords. So, when I apply this to my function it is removing 'd' from the word. Please see the attached picture for the reference and guide me how to fix this.
enter image description here
I checked out the code and noticed that you are applying the rem_stopwords function on the clean_text column, while you should apply it on tweet column.
Otherwise, NLTK removes d, I, and other characters when they are independent tokens, a token here is a word after you split on spaces, so if you have i'd, it will not remove d nor I since they are combined into a word. However if you have 'I like Football' it will remove I, since it will be an independent token.
You can try this code, it will solve your problem
import pandas as pd
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop = set(stopwords.words('english'))
df['clean_text'] = df['Tweet'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in (stop)]))
Related
as a part of a large project, I need a function that will check for any misspelt words in a sentence, however, this sentence can be one word or it can be 30 words or any size really.
It needs to be fast, if possible I would like to use text blob or pyspellcheck as python_language_tool has problems installing on my comp.
My code so far (non-working):
def spell2():
from textblob import TextBlob
count = 0
sentence = "Tish soulhd al be corrrectt"
split_sen = sentence.split(" ")
for thing in split_sen:
thing = Word(thing)
thing.spellcheck()
# if thing is not spelt correctly add to count, if it is go to
# next word
spell2()
this gives me this error:
thing = Word(thing)
NameError: name 'Word' is not defined
Any suggestions appreciated:)
def spell3():
from spellchecker import SpellChecker
s = "Tish soulhd al be corrrectt, riiiigghtttt?"
wordlist=s.split()
spell = SpellChecker()
amount_miss = len(list(spell.unknown(wordlist)))
print("Possible amount of misspelled words in the text:",amount_miss)
spell3()
How do I lemmatize words in the nested list in a single line? I tried few things, I am getting close but I think I may be getting syntax wrong? How do I fix it?
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
word_list = [['test','exams','projects'],['math','exam','things']]
word_list # type list
Try #1: Does the lemmatization but in different format
for word in word_list:
for e in word:
print(lemmatizer.lemmatize(e)) # not the result I need for
Try #2: Looking for similar approach in one line to solve the problem. Not giving correct results.
[[word for word in lemmatizer.lemmatize(str(doc))] for doc in word_list]
Output needed:
[['test','exam','project'],['math','exam','thing']]
I found a for loop solution for my question above. I couldn't get this into a single line, but it is working for now. If any one is looking for solution:
word_list_lemma = []
for ls in word_list:
word_lem = []
for word in ls:
word_lem.append(lemmatizer.lemmatize(word))
word_list_lemma.append(word_lem)
I am trying to remove stop words of my data set.
stopwordsw = nltk.corpus.stopwords.words('german')
def remove_stopwords(txt_clean):
txt_clean = [Word for Word in txt_clean if Word not in stopwords]
return txt_clean
data['Tweet_sw'] = data['Tweet_clean'].apply(lambda x: remove_stopwords(x))
data.head()
I have two problems with that.
First, the output is given character by character (separated by a comma), although I run the check against the list of stopwords with words.
I can solve this problem with a join command, but I don't understand why it is split into characters.
The second and real problem is that the removal of stop words does not work. Words that are clearly in the list are not removed from the sentences.
Where is my mistake in this?
image
txt_clean = [Word for Word in txt_clean.split() if Word not in stopwords]
I dont quite understand why I cannot Lemmatize or do Stemming. I tried converting the array to string, but I have no luck.
This is my code.
import bs4, re, string, nltk, numpy as np, pandas as pd
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
news_url="https://news.google.com/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()
soup_pg=soup(xml_page,"xml")
news_lst=soup_page.findAll("item")
limit=19
corpus = []
# Print news title, url and publish date
for index, news in enumerate(news_list):
#print(news.title.text)
#print(index+1)
corpus.append(news.title.text)
if index ==limit:
break
#print(arrayList)
df = pd.DataFrame(corpus, columns=['News'])
wpt=nltk.WordPunctTokenizer()
stop_words=nltk.corpus.stopwords.words('english')
def normalize_document (doc):
#lowercase and remove special characters\whitespace
doc=re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A) #re.I ignore case sensitive, ASCII-only matching
doc=doc.lower()
doc=doc.strip()
#tokenize document
tokens=wpt.tokenize(doc)
#filter stopwords out of document
filtered_tokens=[token for token in tokens if token not in stop_words]
#re-create documenr from filtered tokens
doc=' '.join(filtered_tokens)
return doc
normalize_corpus=np.vectorize(normalize_document)
norm_corpus=normalize_corpus(corpus)
norm_corpus
The error I get starts with the next lines I add
stemmer = PorterStemmer()
sentences = nltk.sent_tokenize(norm_corpus)
# Stemming
for i in range(len(norm_corpus)):
words = nltk.word_tokenize(norm_corpus[i])
words = [stemmer.stem(word) for word in words]
norm_corpus[i] = ' '.join(words)
once I insert these lines then I get the following error:
TypeError: cannot use a string pattern on a bytes-like object
I think if I solve the error with stemming it will be the same solution to my error with lemmatization.
The type of norm_corpus is numpy.ndarray, i.e bytes. The sent_tokenize method expects a string, hence the error. You need to convert norm_corpus to a list of strings to get rid of this error.
What I don't understand is why would you vectorize the document before stemming? Is there a problem of doing it other way around, i.e first stemming and then vectorize. The error should be resolved then
I transferred all my python3 codes from macOS to Ubuntu 18.04 and in one program I need to use pandas.clipboard(). At this point of time there is a list in the clipboard with multiple lines and columns divided by tabs and each element in quotation marks.
After just trying
import pandas as pd
df = pd.read_clipboard()
I'm getting this error: pandas.errors.ParserError: Expected 8 fields in line 3, saw 11. Error could possibly be due to quotes being ignored when a multi-char delimiter is used.. And line 3 looks like "word1" "word2 and another" "word3" .... Without the quotation marks you count 11 elements and within quotation marks you count 8.
In the next step I tried
import pandas as pd
df = pd.read_clipboard(sep='\t')
and I'm getting no errors but it results only in a Series with each line of the clipboard source in one element.
Yes, maybe it's a solution to write a code for separating each element of a line after this step but because it's working very well under macOS (with just pd.read_clipboard()) I hope that there's a better solution.
Thank you for helping.
I wrote a "turnaround" for my question. It's not the exact solution but because I just need the elements of one column in an array I solved it like that:
import pyperclip
# read clipboard
cb = pyperclip.paste()
# lines in array
cb_arr = cb.splitlines()
column = []
for cb_line in cb_arr:
# words in array
cb_words = cb_line.split("\"")
# pick element of column 1
word = cb_words[1]
column.append(word)
# delete column name
column.pop(0)
print(column)
Maybe it helps someone else, too.