Stemming and Lemmatization on Array - python-3.x

I dont quite understand why I cannot Lemmatize or do Stemming. I tried converting the array to string, but I have no luck.
This is my code.
import bs4, re, string, nltk, numpy as np, pandas as pd
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
news_url="https://news.google.com/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()
soup_pg=soup(xml_page,"xml")
news_lst=soup_page.findAll("item")
limit=19
corpus = []
# Print news title, url and publish date
for index, news in enumerate(news_list):
#print(news.title.text)
#print(index+1)
corpus.append(news.title.text)
if index ==limit:
break
#print(arrayList)
df = pd.DataFrame(corpus, columns=['News'])
wpt=nltk.WordPunctTokenizer()
stop_words=nltk.corpus.stopwords.words('english')
def normalize_document (doc):
#lowercase and remove special characters\whitespace
doc=re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A) #re.I ignore case sensitive, ASCII-only matching
doc=doc.lower()
doc=doc.strip()
#tokenize document
tokens=wpt.tokenize(doc)
#filter stopwords out of document
filtered_tokens=[token for token in tokens if token not in stop_words]
#re-create documenr from filtered tokens
doc=' '.join(filtered_tokens)
return doc
normalize_corpus=np.vectorize(normalize_document)
norm_corpus=normalize_corpus(corpus)
norm_corpus
The error I get starts with the next lines I add
stemmer = PorterStemmer()
sentences = nltk.sent_tokenize(norm_corpus)
# Stemming
for i in range(len(norm_corpus)):
words = nltk.word_tokenize(norm_corpus[i])
words = [stemmer.stem(word) for word in words]
norm_corpus[i] = ' '.join(words)
once I insert these lines then I get the following error:
TypeError: cannot use a string pattern on a bytes-like object
I think if I solve the error with stemming it will be the same solution to my error with lemmatization.

The type of norm_corpus is numpy.ndarray, i.e bytes. The sent_tokenize method expects a string, hence the error. You need to convert norm_corpus to a list of strings to get rid of this error.
What I don't understand is why would you vectorize the document before stemming? Is there a problem of doing it other way around, i.e first stemming and then vectorize. The error should be resolved then

Related

error while removing the stop-words from the text

I am trying to remove stopwords from my data and I have used this statement to download the stopwords.
stop = set(stopwords.words('english'))
This has character 'd' as one of the stopwords. So, when I apply this to my function it is removing 'd' from the word. Please see the attached picture for the reference and guide me how to fix this.
enter image description here
I checked out the code and noticed that you are applying the rem_stopwords function on the clean_text column, while you should apply it on tweet column.
Otherwise, NLTK removes d, I, and other characters when they are independent tokens, a token here is a word after you split on spaces, so if you have i'd, it will not remove d nor I since they are combined into a word. However if you have 'I like Football' it will remove I, since it will be an independent token.
You can try this code, it will solve your problem
import pandas as pd
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')
stop = set(stopwords.words('english'))
df['clean_text'] = df['Tweet'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in (stop)]))

lemmatize words in nest list

How do I lemmatize words in the nested list in a single line? I tried few things, I am getting close but I think I may be getting syntax wrong? How do I fix it?
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
word_list = [['test','exams','projects'],['math','exam','things']]
word_list # type list
Try #1: Does the lemmatization but in different format
for word in word_list:
for e in word:
print(lemmatizer.lemmatize(e)) # not the result I need for
Try #2: Looking for similar approach in one line to solve the problem. Not giving correct results.
[[word for word in lemmatizer.lemmatize(str(doc))] for doc in word_list]
Output needed:
[['test','exam','project'],['math','exam','thing']]
I found a for loop solution for my question above. I couldn't get this into a single line, but it is working for now. If any one is looking for solution:
word_list_lemma = []
for ls in word_list:
word_lem = []
for word in ls:
word_lem.append(lemmatizer.lemmatize(word))
word_list_lemma.append(word_lem)

How to tokenize python code using the Tokenize module?

Consider that I have a string that contains the python code.
input = "import nltk
from nltk.stem import PorterStemmer
porter_stemmer=PorterStemmer()
words=["connect","connected","connection","connections","connects"]
stemmed_words=[porter_stemmer.stem(word) for word in words]
stemmed_words"
How can I tokenize the code? I found the tokenize module (https://docs.python.org/3/library/tokenize.html). However, it is not clear to me how to use the module. It has tokenize.tokenize(readline) but the parameter takes a generator, not a string.
import tokenize
import io
inp = """import nltk
from nltk.stem import PorterStemmer
porter_stemmer=PorterStemmer()
words=["connect","connected","connection","connections","connects"]
stemmed_words=[porter_stemmer.stem(word) for word in words]
stemmed_words"""
for token in tokenize.generate_tokens(io.StringIO(inp).readline):
print(token)
tokenize.tokenize takes a method not a string. The method should be a readline method from an IO object.
In addition, tokenize.tokenize expects the readline method to return bytes, you can use tokenize.generate_tokens instead to use a readline method that returns strings.
Your input should also be in a docstring, as it is multiple lines long.
See io.TextIOBase, tokenize.generate_tokens for more info.
If you want to stick with tokenize.tokenize(), then this is what you can do:
from tokenize import tokenize
from io import BytesIO
code = """import nltk
from nltk.stem import PorterStemmer
porter_stemmer=PorterStemmer()
words=["connect","connected","connection","connections","connects"]
stemmed_words=[porter_stemmer.stem(word) for word in words]
stemmed_words"""
for tok in tokenize(BytesIO(code.encode('utf-8')).readline):
print(f"Type: {tok.type}\nString: {tok.string}\nStart: {tok.start}\nEnd: {tok.end}\nLine: {tok.line.strip()}\n======\n")
From the documentation you can see:
The generator produces 5-tuples with these members: the token type; the token string; a 2-tuple (srow, scol) of ints specifying the row and column where the token begins in the source; a 2-tuple (erow, ecol) of ints specifying the row and column where the token ends in the source; and the line on which the token was found. The line passed (the last tuple item) is the physical line. The 5 tuple is returned as a named tuple with the field names: type string start end line.
The returned named tuple has an additional property named exact_type that contains the exact operator type for OP tokens. For all other token types exact_type equals the named tuple type field.

len() function not giving correct character number

I am trying to figure out the number of characters in a string but for some strange reason len() is only giving me back 1.
here is an example of my output
WearWorks is a haptics design company that develops products and
experiences that communicate information through touch. Our first product,
Wayband, is a wearable tactile navigation device for the blind and visually
impaired.
True
1
here is my code
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url="https://www.wear.works/"
response=requests.get(url)
html=response.content
soup=BeautifulSoup(html,'html.parser')
#reference https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python
# getting rid of the script sytle in html
for script in soup(["script", "style"]):
(script.extract()) # rip it out
# print(script)
# get text
# grabbing the first chunk of text
text = soup.get_text()[0]
print(isinstance(text, str))
print(len(text))
print(text)
The problem is text = soup.get_text()[0] convert it to text = soup.get_text() have a look. You're slicing a string to get the first character.

Finding a regex patterned text inside a python variable

# Ex1
# Number of datasets currently listed on data.gov
# http://catalog.data.gov/dataset
import requests
import re
from bs4 import BeautifulSoup
page = requests.get(
"http://catalog.data.gov/dataset")
soup = BeautifulSoup(page.content, 'html.parser')
value = soup.find_all(class_='new-results')
results = re.search([0-9][0-9][0-9],[0-9][0-9][0-9], value
print(value)
The code is above .. I want to find a text in the form on regex = [0-9][0-9][0-9],[0-9][0-9][0-9]
inside the text inside the variable 'value'
How can i do this ?
Based on ShellayLee's suggestion i changed it to
import requests
import re
from bs4 import BeautifulSoup
page = requests.get(
"http://catalog.data.gov/dataset")
soup = BeautifulSoup(page.content, 'html.parser')
value = soup.find_all(class_='new-results')
my_match = re.search(r'\d\d\d,\d\d\d', value)
print(my_match)
STILL GETTING ERROR
Traceback (most recent call last):
File "ex1.py", line 19, in
my_match = re.search(r'\d\d\d,\d\d\d', value)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py", line 182, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
You need some basics of regex in Python. A regex in Python is represented in as a string, and re module provides functions like match, search, findall which can take a string as an argument and treat it as a pattern.
In your case, the pattern [0-9][0-9][0-9],[0-9][0-9][0-9] can be represented as:
my_pattern = r'\d\d\d,\d\d\d'
then used like
my_match = re.search(my_pattern, value_text)
where \d means a digit symbol (same as [0-9]). The r leading the string means the backslaches in the string are not treated as escaper.
The search function returns a match object.
I suggest you walk through some tutorials first to get rid of further confusions. The official HOWTO is already well written:
https://docs.python.org/3.6/howto/regex.html

Resources