How to tokenize python code using the Tokenize module? - python-3.x

Consider that I have a string that contains the python code.
input = "import nltk
from nltk.stem import PorterStemmer
porter_stemmer=PorterStemmer()
words=["connect","connected","connection","connections","connects"]
stemmed_words=[porter_stemmer.stem(word) for word in words]
stemmed_words"
How can I tokenize the code? I found the tokenize module (https://docs.python.org/3/library/tokenize.html). However, it is not clear to me how to use the module. It has tokenize.tokenize(readline) but the parameter takes a generator, not a string.

import tokenize
import io
inp = """import nltk
from nltk.stem import PorterStemmer
porter_stemmer=PorterStemmer()
words=["connect","connected","connection","connections","connects"]
stemmed_words=[porter_stemmer.stem(word) for word in words]
stemmed_words"""
for token in tokenize.generate_tokens(io.StringIO(inp).readline):
print(token)
tokenize.tokenize takes a method not a string. The method should be a readline method from an IO object.
In addition, tokenize.tokenize expects the readline method to return bytes, you can use tokenize.generate_tokens instead to use a readline method that returns strings.
Your input should also be in a docstring, as it is multiple lines long.
See io.TextIOBase, tokenize.generate_tokens for more info.

If you want to stick with tokenize.tokenize(), then this is what you can do:
from tokenize import tokenize
from io import BytesIO
code = """import nltk
from nltk.stem import PorterStemmer
porter_stemmer=PorterStemmer()
words=["connect","connected","connection","connections","connects"]
stemmed_words=[porter_stemmer.stem(word) for word in words]
stemmed_words"""
for tok in tokenize(BytesIO(code.encode('utf-8')).readline):
print(f"Type: {tok.type}\nString: {tok.string}\nStart: {tok.start}\nEnd: {tok.end}\nLine: {tok.line.strip()}\n======\n")
From the documentation you can see:
The generator produces 5-tuples with these members: the token type; the token string; a 2-tuple (srow, scol) of ints specifying the row and column where the token begins in the source; a 2-tuple (erow, ecol) of ints specifying the row and column where the token ends in the source; and the line on which the token was found. The line passed (the last tuple item) is the physical line. The 5 tuple is returned as a named tuple with the field names: type string start end line.
The returned named tuple has an additional property named exact_type that contains the exact operator type for OP tokens. For all other token types exact_type equals the named tuple type field.

Related

Reading from CSV file without tokenizing words into letters and numbers into digits

I am downloading csv file and then reading it using csv module. For some reason, words and numbers get tokenized into letters and single digits. However, there is exception with "1 Mo", "3 Mo" etc.
I am getting csv file from here:
url = https://home.treasury.gov/resource-center/data-chart-center/interest-rates/daily-treasury-rates.csv/2022/all?type=daily_treasury_yield_curve&field_tdr_date_value=2022&page&_format=csv
I use Python 3.10 and the code looks as follows:
from urllib.request import urlopen
import csv
response = urlopen(url)
content = response.read().decode('utf-8')
csv_data = csv.reader(content, delimiter=',')
for row in csv_data:
print(row)
Here is what I am getting:
['D']
['a']
['t']
['e']
['','']
['1 Mo']
['','']
['2 Mo']
['','']
['3 Mo']
['','']
.
.
.
['30 Yr']
[]
['1']
['1']
['/']
['0']
['8']
['/']
.
.
.
I tried different delimiters but it does not help.
P.S. When I simply save csv file to drive and then open it - everything works properly. But I don't want to have this extra step.
Check out the documentation for csv.reader at this link:
csv.reader(csvfile, dialect='excel', **fmtparams)
...csvfile can be any object which supports the iterator protocol and returns a string each time its __next__() method is called -- file objects and list objects are both suitable...
Notice that your variable content is a string, not a file. In Python, strings may be iterators, but their __next__() method does not return the next line. You probably want to convert your long CSV string into a list of lines, so that __next__() (when it is called internally to the reader function) will give the next line instead of the next character. Note that this is why your code mysteriously works when you save the CSV to a file first - an open file iterator returns the next line of input each time __next__() is invoked.
To accomplish this, try using the following line in place of line 4:
content = response.read().decode('utf-8').split("\n")

TypeError: lemmatize() missing 1 required positional argument: 'word (WordNetLemmatizer)

I am facing a problem with WordNetLemmatizer.
What I am doing is filtering out useless words like int and int + str by using CountVectorizer.
Without further ado. My code is following:
letters_only is a function to give me Fales if a word contains int, or int + str
groups = fetch_20newsgroups()
cleaned = []
all_names = set(names.words())
for post in groups:
cleaned.append(''.join([lemmatizer.lemmatize(word.lower())
for word in post.split()
if letters_only(word)
and word not in all_names]))
Just in case, I tried this as well:
for post in groups.data:
for word in post.split():
if letters_only(word) and word not in all_names:
cleaned.append(''.join(lemmatizer.lemmatize(word.lower()))
Two separate codes give me the same error which is "TypeError: lemmatize() missing 1 required positional argument: 'word".
I'm not fully certain, because you didn't include the piece of code where lemmatizer was initialised, but I think this is the error you're experiencing:
from nltk import WordNetLemmatizer
lemmatizer = WordNetLemmatizer
word = "Arguments"
print(lemmatizer.lemmatize(word.lower()))
Which outputs:
Traceback (most recent call last):
File "c:\...\nltk_70755449.py", line 11, in <module>
print(lemmatizer.lemmatize(word.lower()))
TypeError: lemmatize() missing 1 required positional argument: 'word'
As you can see from the WordNetLemmatizer documentation, you must initialize the class WordNetLemmatizer, and then call the class method lemmatize on that. Right now, lemmatizer is just a reference to the class WordNetLemmatizer, instead of an instance of this class. Luckily, this is an easy fix, as we can initialize this class with WordNetLemmatizer():
from nltk import WordNetLemmatizer
lemmatizer = WordNetLemmatizer() # <- Notice the class initialization with ()
word = "Arguments"
print(lemmatizer.lemmatize(word.lower()))
This outputs:
argument
That should solve your problem.

Stemming and Lemmatization on Array

I dont quite understand why I cannot Lemmatize or do Stemming. I tried converting the array to string, but I have no luck.
This is my code.
import bs4, re, string, nltk, numpy as np, pandas as pd
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
news_url="https://news.google.com/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()
soup_pg=soup(xml_page,"xml")
news_lst=soup_page.findAll("item")
limit=19
corpus = []
# Print news title, url and publish date
for index, news in enumerate(news_list):
#print(news.title.text)
#print(index+1)
corpus.append(news.title.text)
if index ==limit:
break
#print(arrayList)
df = pd.DataFrame(corpus, columns=['News'])
wpt=nltk.WordPunctTokenizer()
stop_words=nltk.corpus.stopwords.words('english')
def normalize_document (doc):
#lowercase and remove special characters\whitespace
doc=re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A) #re.I ignore case sensitive, ASCII-only matching
doc=doc.lower()
doc=doc.strip()
#tokenize document
tokens=wpt.tokenize(doc)
#filter stopwords out of document
filtered_tokens=[token for token in tokens if token not in stop_words]
#re-create documenr from filtered tokens
doc=' '.join(filtered_tokens)
return doc
normalize_corpus=np.vectorize(normalize_document)
norm_corpus=normalize_corpus(corpus)
norm_corpus
The error I get starts with the next lines I add
stemmer = PorterStemmer()
sentences = nltk.sent_tokenize(norm_corpus)
# Stemming
for i in range(len(norm_corpus)):
words = nltk.word_tokenize(norm_corpus[i])
words = [stemmer.stem(word) for word in words]
norm_corpus[i] = ' '.join(words)
once I insert these lines then I get the following error:
TypeError: cannot use a string pattern on a bytes-like object
I think if I solve the error with stemming it will be the same solution to my error with lemmatization.
The type of norm_corpus is numpy.ndarray, i.e bytes. The sent_tokenize method expects a string, hence the error. You need to convert norm_corpus to a list of strings to get rid of this error.
What I don't understand is why would you vectorize the document before stemming? Is there a problem of doing it other way around, i.e first stemming and then vectorize. The error should be resolved then

How does ruamel.yaml determine the encoding of escaped byte sequences in a string?

I am having trouble figuring out where to modify or configure ruamel.yaml's loader to get it to parse some old YAML with the correct encoding. The essence of the problem is that an escaped byte sequence in the document seems to be interpreted as latin1, and I have no earthly clue where it is doing that, after some source diving here. Here is a code sample that demonstrates the behavior (this in particular was run in Python 3.6):
from ruamel.yaml import YAML
yaml = YAML()
yaml.load('a:\n b: "\\xE2\\x80\\x99"\n') # Note that this is a str (that is, unicode) with escapes for the byte escapes in the YAML document
# ordereddict([('a', ordereddict([('b', 'â\x80\x99')]))])
Here are the same bytes decoded manually, just to show what it should parse to:
>>> b"\xE2\x80\x99".decode('utf8')
'’'
Note that I don't really have any control over the source document, so modifying it to produce the correct output with ruamel.yaml is out of the question.
ruamel.yaml doesn't interpret individual strings, it interprets the
stream it gets hanled, i.e. the argument to .load(). If that
argument is a byte-stream or a file like object then its encoding is
determined based on the BOM, defaulting to UTF-8. But again: that is
at the stream level, not at individual scalar content after
interpreting escapes. Since you hand .load() Unicode (as this is
Python 3) that "stream" needs no further decoding. (Although
irrelevant for this question: it is done in the reader.py:Reader methods stream and
determine_encoding)
The hex escapes (of the form \xAB), will just put a specific hex
value in the type the loader uses to construct the scalar, that is
value for key 'b', and that is a normal Python 3 str i.e. Unicode in
one of its internal representations. That you get the â in your
output is because of how your Python is configured to decode it str
tyes.
So you won't "find" the place where ruamel.yaml decodes that
byte-sequence, because that is already assumed to be Unicode.
So the thing to do is that you double decode your double quoted
scalars (you only have to address those as plain, single quoted,
literal/folded scalars cannot have the hex escapes). There are various
points at which you can try to do that, but I think
constructor.py:RoundTripConsturtor.construct_scalar and
scalarstring.py:DoubleQuotedScalarString are the best candidates. The former of those might take some digging to find, but the latter is actually the type you'll get if you inspect
that string after loading when you add the option to preserve quotes:
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n b: "\\xE2\\x80\\x99"\n')
print(type(data['a']['b']))
which prints:
<class 'ruamel.yaml.scalarstring.DoubleQuotedScalarString'>
knowing that you can inspect that rather simple wrapper class:
class DoubleQuotedScalarString(ScalarString):
__slots__ = ()
style = '"'
def __new__(cls, value, anchor=None):
# type: (Text, Any) -> Any
return ScalarString.__new__(cls, value, anchor=anchor)
"update" the only method there (__new__) to do your double
encoding (you might have to put in additional checks to not double encode all
double quoted scalars0:
import sys
import codecs
import ruamel.yaml
def my_new(cls, value, anchor=None):
# type information only needed if using mypy
# value is of type 'str', decode to bytes "without conversion", then encode
value = value.encode('latin_1').decode('utf-8')
return ruamel.yaml.scalarstring.ScalarString.__new__(cls, value, anchor=anchor)
ruamel.yaml.scalarstring.DoubleQuotedScalarString.__new__ = my_new
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
data = yaml.load('a:\n b: "\\xE2\\x80\\x99"\n')
print(data)
which gives:
ordereddict([('a', ordereddict([('b', '’')]))])

Getting a value error: invalid literal for int() with base 10: '56,990'

So I am trying to scrap a website containing price of a laptop.However it is a srting and for comparison purposes I need to convert it to int.But on using the same I get a none type error: invalid literal for int() with base 10: '56,990'
Below is the code:
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.flipkart.com/apple-macbook-air-core-i5-5th-gen-8-gb-128-gb-ssd-mac-os-sierra-mqd32hn-a-a1466/p/itmevcpqqhf6azn3?pid=COMEVCPQBXBDFJ8C&srno=s_1_1&otracker=search&lid=LSTCOMEVCPQBXBDFJ8C5XWYJP&fm=SEARCH&iid=2899998f-8606-4b81-a303-46fd62a7882b.COMEVCPQBXBDFJ8C.SEARCH&qH=9e3635d7234e9051")
data = r.text
soup = BeautifulSoup(data,"lxml")
data=soup.find('div',{"class":"_1vC4OE _37U4_g"})
cost=(data.text[1:].strip())
print(int(cost))
PS:I used text[1:] toremove the currency character
I get error in the last line.Basically I need to get the int value of the cost.
The value has a comma in it. So you need to replace the comma with empty character before converting it to integer.
print(int(cost.replace(',','')))
python does not understand , group separators in integers, so you'll need to remove them. Try:
cost = data.text[1:].strip().translate(None,',')
Rather than invent a new solution for every character you don't want (strip() function for whitespace, [1:] index for the currency, something else for the digit separator) consider a single solution to gather what you do want:
>>> import re
>>> text = "\u20B956,990\n"
>>> cost = re.sub(r"\D", "", text)
>>> print(int(cost))
56990
The re.sub() replaces anything that isn't a digit with nothing.

Resources