Parsing html tags with Python - python-3.x

I have been given an url and I want to extract the contents of the <BODY> tag from the url.
I'm using Python3. I came across sgmllib but it is not available for Python3.
Can someone please guide me with this? Can I use HTMLParser for this?
Here is what i tried:
import urllib.request
f=urllib.request.urlopen("URL")
s=f.read()
from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_data(self, data):
print("Encountered some data:", data)
parser = MyHTMLParser()
parser.feed(s)
this gives me error : TypeError: Can't convert 'bytes' object to str implicitly

To fix the TypeError change line #3 to
s = str(f.read())
The web page you're getting is being returned in the form of bytes, and you need to change the bytes into a string to feed them to the parser.

If you take a look at your s variable its type is byte.
>>> type(s)
<class 'bytes'>
and if you take a look at Parser.feed it requires a string or unicode as an argument.So,do
>>> x = s.decode('utf-8')
>>> type(x)
<class 'str'>
>>> parser.feed(x)
or do x = str(s).

Related

How to scrape this pdf file?

I want to scrape tables of this persian pdf file and get the results as a pandas dataframe but I get error "NameError: name 'PDFResourceManager' is not defined" and no good content is extracted.
please help me to find a true encoded solution for it. Including your tested code is appreciated.
from pdfminer.converter import TextConverter
from io import StringIO
from io import open
from urllib.request import urlopen
import pdfminer as pm
urlpdf="https://www.codal.ir/Reports/DownloadFile.aspx?id=jck8NF9OtmFW6fpyefK09w%3d%3d"
response = requests.get(urlpdf, verify=False, timeout=5)
f=io.BytesIO(response.content)
def readPDF(f):
rsrcmgr=PDFResourceManager()
retstr=StringIO()
laparams=LAParams()
device=TextConverter(rsrcmgr,retstr,laparams=laparams)
process_pdf(rsrcmgr,device,pdfFile)
device.close()
content=retstr.getvalue()
retstr.close()
return content
pdfFile=urlopen(urlpdf)
outputString=readPDF(pdfFile)
proceedings=outputString.encode('utf-8') # creates a UTF-8 byte object
proceedings=str(proceedings) # creates string representation <- the source of your issue
file=open("extract.txt","w", encoding="utf-8") # encodes str to platform specific encoding.
file.write(proceedings)
file.close()

AttributeError when not trying to use any attribute

Good morning!
I'm using a code (with python 3.8) that is running both in a local PC and in a ssh server. In one point, I'm loading data from a pickle using the next piece of code:
from os.path import exists
import _pickle as pickle
def load_pickle(pickle_file):
if exists(pickle_file):
with open(pickle_file, 'rb') as f:
loaded_dic = pickle.load(f)
return loaded_dic
else:
return 'Pickle not found'
pickle_file is a string with the path of the pickle. If the pickle exists, the function returns a dictionary, while if it doesn't exist, it returns the string 'Pickle not found'.
In my local PC, the code works perfectly, loading the dict without problems. However, in the ssh server, theoretically, the dict is loaded, but, if I try to access to it, jus typing loaded_dic, it throws the following error:
AttributeError: 'NoneType' object has no attribute 'axes'
Due to it, the rest of my code fails when it try to use the variable loaded_dic.
Thank you very much in advance!
I have a similar problem. For me it happens as I store pandas DataFrames in a dictionary and save this dict as a pickle with pandas version '1.1.1'.
When I read the dictionary pickle with pandas version '0.25.3' on another server, I get the same error.
Both have pickle version 4.0 and I do not have a solution yet, other than upgrading to similar pandas versions.
I made a small example, it also happens when I store just a DataFrame, Saving it on one machine:
import pandas as pd
print("Pandas version", pd.__version__)
df = pd.DataFrame([1, 2, 3])
df.to_pickle('df.pkl')
Pandas version 1.1.1
Then loading it on another machine:
import pandas as pd
print("Pandas version", pd.__version__)
df = pd.read_pickle('df.pkl')
print(type(df))
Pandas version 0.25.3
<class 'pandas.core.frame.DataFrame'>
print(len(df))
results in this error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-8-9f6b6d8c3cd3> in <module>
----> 1 print(len(df))
/opt/conda/lib/python3.7/site-packages/pandas/core/frame.py in __len__(self)
994 Returns length of info axis, but here we use the index.
995 """
--> 996 return len(self.index)
997
998 def dot(self, other):
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
5173 or name in self._accessors
5174 ):
-> 5175 return object.__getattribute__(self, name)
5176 else:
5177 if self._info_axis._can_hold_identifiers_and_holds_name(name):
pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__get__()
AttributeError: 'NoneType' object has no attribute 'axes'
Avi's answer helped me. I had pickled with a later version of Pandas and was trying to read the pickle file with an earlier version.
This code clearly doesn't work anywhere; it doesn't return loaded_dict, a local variable, so nothing can use it. Change it to:
return pickle.load(f)
and the caller will receive the loaded dict instead of the default return value, None.
Update for edited question: With the return, your code works as written. Your pickle file on said machine must have the result of pickling None stored in it, rather than whatever you expected. Or your code is broken in some other place we haven't seen. The loading code is fine, and behaving exactly as its supposed to.

a bytes-like object is required, not '_io.BufferedReader'

I'm trying to load the dumped files, using the following code:
cols = None
with open('./experiments/columns.p', 'rb') as p:
cols = pkl.loads(p).read()
but i get this error instead:
"a bytes-like object is required, not '_io.BufferedReader' "
You're using pickle, so you should use the pickle.load function:
import pickle
with open('./experiments/columns.p', 'rb') as p:
cols = pickle.load(p)
This is less likely to trigger a MemoryError.

Spacy - Convert Token type into list

I have few elements which I got after performing operation in spacy having type
Input -
li = ['India', 'Australia', 'Brazil']
for i in li:
print(type(i))
Output:
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
<class 'spacy.tokens.token.Token'>
I want to make all elements in list with str type for iteration.
Expected output -
li = ['India', 'Australia', 'Brazil']
for i in li:
print(type(i))
Output
<class 'str'>
<class 'str'>
<class 'str'>
please suggest some optimized way..
Spacy Token has a attribute called text.
Here's a complete example:
import spacy
nlp = spacy.load('en_core_web_sm')
t = (u"India Australia Brazil")
li = nlp(t)
for i in li:
print(i.text)
or if you want the list of tokens as list of strings:
list_of_strings = [i.text for i in li]
Thanks for the solution and for sharing your knowledge. It works very well to convert a spacy doc/span to a string or list of strings to further use them in string operations.
you can also use this:-
for i in li:
print(str(i))

Python - Storing float values in CSV file

I am trying to store the positive and negative score of statements in a text file. I want to store the score in a csv file. I have implemented the below given code:
import openpyxl
from nltk.tokenize import sent_tokenize
import csv
from senti_classifier import senti_classifier
from nltk.corpus import wordnet
file_content = open('amazon_kindle.txt')
for lines in file_content:
sentence = sent_tokenize(lines)
pos_score,neg_score = senti_classifier.polarity_scores(sentence)
with open('target.csv','w') as f:
writer = csv.writer(f,lineterminator='\n',delimiter=',')
for val in range(pos_score):
writer.writerow(float(s) for s in val[0])
f.close()
But the code displays me the following error in for loop.
Traceback (most recent call last):
File "C:\Users\pc\AppData\Local\Programs\Python\Python36-32\classifier.py",
line 21, in for val in pos_score: TypeError: 'float' object is not iterable
You have several errors with your code:
Your code and error do not correspond with each other.
for val in pos_score: # traceback
for val in range(pos_score): #code
pos_score is a float so both are errors range() takes an int and for val takes an iterable. Where do you expect to get your list of values from?
And from usage it looks like you are expecting a list of list of values because you are also using a generator expression in your writerow
writer.writerow(float(s) for s in val[0])
Perhaps you are only expecting a list of values so you can get rid of the for loop and just use:
writer.writerow(float(val) for val in <list_of_values>)
Using:
with open('target.csv','w') as f:
means you no longer need to call f.close() and with closes the file at the end of the with block. This also means the writerow() needs to be in the with block:
with open('target.csv','w') as f:
writer = csv.writer(f,lineterminator='\n',delimiter=',')
writer.writerow(float(val) for val in <list_of_values>)

Resources