Python Program ValueError: invalid literal for int() with base 10: - python-3.x

I am a python-beginner, trying to write a program to: "Corona Virus Live Updates for India – Using Python".
I am getting this error, after running the code/programe:
performance.append(int(row[2]) + int(row[3]))
ValueError: invalid literal for int() with base 10:
What can I do to fix this problem?
The Code:
extract_contents = lambda row: [x.text.replace('\n', '') for x in row]
URL = 'https://www.mohfw.gov.in/'
SHORT_HEADERS = ['SNo', 'State','Indian-Confirmed',
'Foreign-Confirmed','Cured','Death']
response = requests.get(URL).content
soup = BeautifulSoup(response, 'html.parser')
header = extract_contents(soup.tr.find_all('th'))
stats = []
all_rows = soup.find_all('tr')
for row in all_rows:
stat = extract_contents(row.find_all('td'))
if stat:
if len(stat) == 5:
# last row
stat = ['', *stat]
stats.append(stat)
elif len(stat) == 6:
stats.append(stat)
stats[-1][1] = "Total Cases"
stats.remove(stats[-1])
#Step #3:
objects = []
for row in stats :
objects.append(row[1])
y_pos = np.arange(len(objects))
performance = []
for row in stats:
performance.append(int(row[2]) + int(row[3]))
table = tabulate(stats, headers=SHORT_HEADERS)
print(table)

just changed the line performance.append(int(row[2]) + int(row[3])) to performance.append(row[2] +str(int(float(row[3]))) )
Full Code:
import requests
from bs4 import BeautifulSoup
import numpy as np
from tabulate import tabulate
extract_contents = lambda row: [x.text.replace('\n', '') for x in row]
URL = 'https://www.mohfw.gov.in/'
SHORT_HEADERS = ['SNo', 'State','Indian-Confirmed', 'Foreign-Confirmed','Cured','Death']
response = requests.get(URL).content
soup = BeautifulSoup(response, 'html.parser')
header = extract_contents(soup.tr.find_all('th'))
stats = []
all_rows = soup.find_all('tr')
for row in all_rows:
stat = extract_contents(row.find_all('td'))
if stat:
if len(stat) == 5:
# last row
stat = ['', *stat]
stats.append(stat)
elif len(stat) == 6:
stats.append(stat)
stats[-1][1] = "Total Cases"
stats.remove(stats[-1])
#Step #3:
objects = [row[1] for row in stats ]
y_pos = np.arange(len(objects))
performance = [row[2] +str(int(float(row[3]))) for row in stats]
table = tabulate(stats, headers=SHORT_HEADERS)
print(table)

Related

Plotting multiple lines with a Nested Dictionary, and unknown variables to Line Graph

I was able to find somewhat of an answer to my question, but it was not as nested as my dictionary and so I am really unsure how to proceed as I am still very new to python. I currently have a nested dictionary like
{'140.10': {'46': {'1': '-49.50918', '2': '-50.223637', '3': '49.824406'}, '28': {'1': '-49.50918', '2': '-50.223637', '3': '49.824406'}}}:
I am wanting to plot it so that '140.10' becomes the title of the graph and '46' and '28' become the individual lines and key '1' for example is on the y axis and the x axis is the final number (in this case '-49.50918). Essentially a graph like this:
I generated this graph with a csv file that is written at another part of the code just with excel:
[![enter image description here][2]][2]
The problem I am running into is that these keys are autogenerated from a larger csv file and I will not know their exact value until the code has been run. As each of the keys are autogenerated in an earlier part of the script. As I will be running it over various files called the Graph name, and each file will have a different values for:
{key1:{key2_1: {key3_1: value1, key3_2: value2, key3_3: value3}, key_2_2 ...}}}
I have tried to do something like this:
for filename in os.listdir(Directory):
if filename.endswith('.csv'):
q = filename.split('.csv')[0]
s = q.split('_')[0]
if s in time_an_dict:
atom = list(time_an_dict[s])
ion = time_an_dict[s]
for f in time_an_dict[s]:
x_val = []
y_val = []
fz = ion[f]
for i in time_an_dict[s][f]:
pos = (fz[i])
frame = i
y_val.append(frame)
x_val.append(pos)
'''ions = atom
frame = frames
position = pos
plt.plot(frame, position, label = frames)
plt.xlabel("Frame")
plt.ylabel("Position")
plt.show()
#plt.savefig('{}_Pos.png'.format(s))'''
But it has not run as intended.
I have also tried:
for filename in os.listdir(Directory):
if filename.endswith('_Atom.csv'):
q = filename.split('.csv')[0]
s = q.split('_')[0]
if s in window_dict:
name = s + '_Atom.csv'
time_an_dict[s] = analyze_time(name,window_dict[s])
new = '{}_A_pos.csv'.format(s)
ions = list(time_an_dict.values())[0].keys()
for i in ions:
x_axis_values = []
y_axis_values = []
frame = list(time_an_dict[s][i])
x_axis_values.append(frame)
empty = []
print(x_axis_values)
for x in frame:
values = time_an_dict[s][i][x]
empty.append(values)
y_axis_values.append(empty)
plt.plot(x_axis_values, y_axis_values, label = x )
plt.show()
But keep getting the error:
Traceback (most recent call last): File "Atoms_pos.py", line 175, in
plt.plot(x_axis_values, y_axis_values, label = x ) File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/pyplot.py",
line 2840, in plot
return gca().plot( File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/axes/_axes.py",
line 1743, in plot
lines = [*self._get_lines(*args, data=data, **kwargs)] File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/axes/_base.py",
line 273, in call
yield from self._plot_args(this, kwargs) File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/axes/_base.py",
line 394, in _plot_args
self.axes.xaxis.update_units(x) File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/axis.py",
line 1466, in update_units
default = self.converter.default_units(data, self) File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/category.py",
line 107, in default_units
axis.set_units(UnitData(data)) File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/category.py",
line 176, in init
self.update(data) File "/Users/hxb51/opt/anaconda3/lib/python3.8/site-packages/matplotlib/category.py",
line 209, in update
for val in OrderedDict.fromkeys(data): TypeError: unhashable type: 'numpy.ndarray'
Here is the remainder of the other parts of the code that generate the files and dictionaries I am using. I was told in another question I asked that this could be helpful.
# importing dependencies
import math
import sys
import pandas as pd
import MDAnalysis as mda
import os
import numpy as np
import csv
import matplotlib.pyplot as plt
################################################################################
###############################################################################
Directory = '/Users/hxb51/Desktop/Q_prof/Displacement_Charge/Blah'
os.chdir(Directory)
################################################################################
''' We are only looking at the positions of the CLAs and SODs and not the DRUDE counterparts. We are assuming the DRUDE
are very close and it is not something that needs to be concerned with'''
def Positions(dcd, topo):
fields = ['Window', 'ION', 'ResID', 'Location', 'Position', 'Frame', 'Final']
with open('{}_Atoms.csv'.format(s), 'a') as d:
writer = csv.writer(d)
writer.writerow(fields)
d.close()
CLAs = u.select_atoms('segid IONS and name CLA')
SODs = u.select_atoms('segid IONS and name SOD')
CLA_res = len(CLAs)
SOD_res = len(SODs)
frame = 0
for ts in u.trajectory[-10:]:
frame +=1
CLA_pos = CLAs.positions[:,2]
SOD_pos = SODs.positions[:,2]
for i in range(CLA_res):
ids = i + 46
if CLA_pos[i] < 0:
with open('{}_Atoms.csv'.format(s), 'a') as q:
new_line = [s,'CLA', ids, 'Bottom', CLA_pos[i], frame,10]
writes = csv.writer(q)
writes.writerow(new_line)
q.close()
else:
with open('{}_Atoms.csv'.format(s), 'a') as q:
new_line = [s,'CLA', ids, 'Top', CLA_pos[i], frame, 10]
writes = csv.writer(q)
writes.writerow(new_line)
q.close()
for i in range(SOD_res):
ids = i
if SOD_pos[i] < 0:
with open('{}_Atoms.csv'.format(s), 'a') as q:
new_line = [s,'SOD', ids, 'Bottom', SOD_pos[i], frame,10]
writes = csv.writer(q)
writes.writerow(new_line)
q.close()
else:
with open('{}_Atoms.csv'.format(s), 'a') as q:
new_line = [s,'SOD', ids, 'Top', SOD_pos[i], frame, 10]
writes = csv.writer(q)
writes.writerow(new_line)
q.close()
csv_Data = pd.read_csv('{}_Atoms.csv'.format(s))
filename = s + '_Atom.csv'
sorted_df = csv_Data.sort_values(["ION", "ResID", "Frame"],
ascending=[True, True, True])
sorted_df.to_csv(filename, index = False)
os.remove('{}_Atoms.csv'.format(s))
''' this function underneath looks at the ResIds, compares them to make sure they are the same and then counts how many
times the ion flip flops around the boundaries'''
def turn_dict(f):
read = open(f)
reader = csv.reader(read, delimiter=",", quotechar = '"')
my_dict = {}
new_list = []
for row in reader:
new_list.append(row)
for i in range(len(new_list[:])):
prev = i - 1
if new_list[i][2] == new_list[prev][2]:
if new_list[i][3] != new_list[prev][3]:
if new_list[i][2] in my_dict:
my_dict[new_list[i][2]] += 1
else:
my_dict[new_list[i][2]] = 1
return my_dict
def plot_flips(f):
dict = turn_dict(f)
ions = list(dict.keys())
occ = list(dict.values())
plt.bar(range(len(dict)), occ, tick_label = ions)
plt.title("{}".format(s))
plt.xlabel("Residue ID")
plt.ylabel("Boundary Crosses")
plt.savefig('{}_Flip.png'.format(s))
def analyze_time(f, dicts):
read = open(f)
reader = csv.reader(read, delimiter=",", quotechar='"')
new_list = []
keys = list(dicts.keys())
time_dict = {}
pos_matrix = {}
for row in reader:
new_list.append(row)
fields = ['ResID', 'Position', 'Frame']
with open('{}_A_pos.csv'.format(s), 'a') as k:
writer = csv.writer(k)
writer.writerow(fields)
k.close()
for i in range(len(new_list[:])):
if new_list[i][2] in keys:
with open('{}_A_pos.csv'.format(s), 'a') as k:
new_line = [new_list[i][2], new_list[i][4], new_list[i][5]]
writes = csv.writer(k)
writes.writerow(new_line)
k.close()
read = open('{}_A_pos.csv'.format(s))
reader = csv.reader(read, delimiter=",", quotechar='"')
time_list = []
for row in reader:
time_list.append(row)
for j in range(len(keys)):
for i in range(len(time_list[1:])):
if time_list[i][0] == keys[j]:
pos_matrix[time_list[i][2]] = time_list[i][1]
time_dict[keys[j]] = pos_matrix
return time_dict
window_dict = {}
for filename in os.listdir(Directory):
s = filename.split('.dcd')[0]
fors = s + '.txt'
topos = '/Users/hxb51/Desktop/Q_prof/Displacement_Charge/topo.psf'
if filename.endswith('.dcd'):
print('We are starting with {} \n '.format(s))
u = mda.Universe(topos, filename)
Positions(filename, topos)
name = s + '_Atom.csv'
plot_flips(name)
window_dict[s] = turn_dict(name)
continue
time_an_dict = {}
for filename in os.listdir(Directory):
if filename.endswith('.csv'):
q = filename.split('.csv')[0]
s = q.split('_')[0]
if s in window_dict:
name = s + '_Atom.csv'
time_an_dict[s] = analyze_time(name,window_dict[s])
for filename in os.listdir(Directory):
if filename.endswith('.csv'):
q = filename.split('.csv')[0]
s = q.split('_')[0]
if s in time_an_dict:
atom = list(time_an_dict[s])
ion = time_an_dict[s]
for f in time_an_dict[s]:
x_val = []
y_val = []
fz = ion[f]
for i in time_an_dict[s][f]:
pos = (fz[i])
frame = i
y_val.append(frame)
x_val.append(pos)
'''ions = atom
frame = frames
position = pos
plt.plot(frame, position, label = frames)
plt.xlabel("Frame")
plt.ylabel("Position")
plt.show()
#plt.savefig('{}_Pos.png'.format(s))'''
Everything here runs well except this last bottom block of code. That deals with trying to make a graph from a nested dictionary. Any help would be appreciated!
Thanks!
I figured out the answer:
for filename in os.listdir(Directory):
if filename.endswith('_Atom.csv'):
q = filename.split('.csv')[0]
s = q.split('_')[0]
if s in window_dict:
name = s + '_Atom.csv'
time_an_dict[s] = analyze_time(name,window_dict[s])
new = '{}_A_pos.csv'.format(s)
ions = list(time_an_dict[s])
plt.yticks(np.arange(-50, 50, 5))
plt.xlabel('Frame')
plt.ylabel('Z axis position(Ang)')
plt.title([s])
for i in ions:
x_value = []
y_value = []
time_frame =len(time_an_dict[s][i]) +1
for frame in range(1,time_frame):
frame = str(frame)
x_value.append(int(frame))
y_value.append(float(time_an_dict[s][i][frame]))
plt.plot(x_value, y_value, label=[i])
plt.xticks(np.arange(1, 11, 1))
plt.legend()
plt.savefig('{}_Positions.png'.format(s))
plt.clf()
os.remove("{}_A_pos.csv".format(s))
From there, with the combo of the other parts of the code, it produces these graphs:
For more than 1 file as long as there is more '.dcd' files.

Unable to scrape all data

from bs4 import BeautifulSoup
import requests , sys ,os
import pandas as pd
URL = r"https://www.vault.com/best-companies-to-work-for/law/top-100-law-firms-rankings/year/"
My_list = ['2007','2008','2009','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019','2020']
Year= []
CompanyName = []
Rank = []
Score = []
print('\n>>Process started please wait\n\n')
for I, Page in enumerate(My_list, start=1):
url = r'https://www.vault.com/best-companies-to-work-for/law/top-100-law-firms-rankings/year/{}'.format(Page)
print('\nData fetching from : ',url)
Res = requests.get(url)
soup = BeautifulSoup(Res.content , 'html.parser')
data = soup.find('section',{'class': 'search-result CompanyWorkfor RankingMain FindSchools school-results contrastSection d-flex justify-content-center min-height Rankings CompRank'})
if len(soup) > 0:
print("\n>>Getting page source for :" , url)
else:
print("Please Check url :",url)
for i, item in enumerate(data.find_all("div", {"class": "RankItem"})):
year = item.find("i",{"class":"fa-stack fa-2x"})
Year.append(year)
title = item.find("h3", {"class": "MainLink"}).get_text().strip()
CompanyName.append(title)
rank = item.find("div", {"class": "RankNumber"}).get_text().strip()
Rank.append(rank)
score = item.find("div", {"class": "score"}).get_text().strip()
Score.append(score)
Data = pd.DataFrame({"Year":Year,"CompanyName":CompanyName,"Rank":Rank,"Score":Score})
Data[['First','Score']] = Data.Score.str.split(" " , expand =True,)
Data[['hash','Rank']] = Data.Rank.str.split("#" , expand = True,)
Data.drop(columns = ['hash','First'],inplace = True)
Data.to_csv('Vault_scrap.csv',index = False)
For each url the expected output Data for year, rank, title and score is 100 lines, but I'm getting only 10 lines.
You can iterate through the year and pages like this.
import requests
import pandas as pd
url = 'https://www.vault.com/vault/api/Rankings/LoadMoreCompanyRanksJSON'
def page_loop(year, url):
tableReturn = pd.DataFrame()
for page in range(1,101):
payload = {
'rank': '2',
'year': year,
'category': 'LBACCompany',
'pg': page}
jsonData = requests.get(url, params=payload).json()
if jsonData == []:
return tableReturn
else:
print ('page: %s' %page)
tableReturn = tableReturn.append(pd.DataFrame(jsonData), sort=True).reset_index(drop=True)
return tableReturn
results = pd.DataFrame()
for year in range(2007,2021):
print ("\n>>Getting page source for :" , year)
jsonData = page_loop(year, url)
results = results.append(pd.DataFrame(jsonData), sort=True).reset_index(drop=True)

lat_long = lat.text.strip('() ').split(',') :AttributeError: 'list' object has no attribute 'text'

Need to find distance between 2 latitude and longitude. The Chrome is being controlled by the driver, then the latitude and longitudes are added to the suitable locations and the it also shows the distance value in the textbox, but it is not able to retrive that generated string of number.Here's the code.Kindly Help.
from selenium import webdriver
import csv
import time
with open('C:/Users/Nisarg.Bhatt/Documents/lats and
longs/Lat_long_cleaned/retail_first.csv', 'r') as f:
reader = csv.reader(f.read().splitlines(), delimiter = ',')
data = [row for row in reader]
filename='C:/Users/Nisarg.Bhatt/Documents/lats and
longs/Lat_long_cleaned/retail_first'
option= webdriver.ChromeOptions()
option.add_argument("-incognito")
path= "Y:/AppData/Local/chromedriver"
browser= webdriver.Chrome(executable_path=path)
url="https://andrew.hedges.name/experiments/haversine/"
browser.get(url)
print(browser.title)
crash = 1
results = []
new=[]
skipped = []
for i,row in enumerate(data[1:]):
print (i)
search = browser.find_element_by_name('lat1')
search_term = data[i+1][5]
search_1=browser.find_element_by_name("lon1")
search_term_1= data[i+1][6]
search_2 = browser.find_element_by_name('lat2')
search_term_2 = data[i+2][5]
search_3 = browser.find_element_by_name('lon2')
search_term_3 = data[i+2][6]
search.clear()
search_1.clear()
search_2.clear()
search_3.clear()
try:
search.send_keys(search_term)
search_1.send_keys(search_term_1)
search_2.send_keys(search_term_2)
search_3.send_keys(search_term_3)
except:
print ('Skiped %s' %search_term)
print (row)
skipped.append(row)
continue
search.submit()
time.sleep(1)
try:
lat = browser.find_elements_by_xpath("/html/body/form/p[4]/input[2]")
except:
alert = browser.switch_to_alert()
alert.accept()
browser.switch_to_default_content()
print ('Couldnt find %s' %search_term)
print (row)
skipped.append(row)
continue
lat_long = lat.text.strip('() ').split(',')
lat_long_clean = [float(n) for n in lat]
try:
browser.refresh()
except:
with open(filename + 'recovered' + '%i' %crash + '.csv' , "wb") as f:
writer = csv.writer(f)
writer.writerows(results)
crash +=1
print (lat_long_clean)
r = row
r.extend(lat_long_clean)
r.insert(0, i)
print (r)
results.append(r)
with open(filename + ".csv", "a") as f:
writer = csv.writer(f)
writer.writerow(r)
with open(filename + "comp.csv" , "wb") as f:
writer = csv.writer(f)
writer.writerows(results)

BeautifulSoup4 Returning Empty List when Attempting to Scrape a Table

I'm trying to pull the data from this url: https://www.winstonslab.com/players/player.php?id=98 and I keep getting the same error when I try to access the tables.
My scraping code is below. I run this, then hp = HTMLTableParser() and table = hp.parse_url('https://www.winstonslab.com/players/player.php?id=98')[0][1] returns the error 'index 0 is out of bounds for axis 0 with size 0'
import requests
import pandas as pd
from bs4 import BeautifulSoup
class HTMLTableParser:
def parse_url(self, url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
return [(table['id'],self.parse_html_table(table))\
for table in soup.find_all('table')]
def parse_html_table(self, table):
n_columns = 0
n_rows=0
column_names = []
# Find number of rows and columns
# we also find the column titles if we can
for row in table.find_all('tr'):
# Determine the number of rows in the table
td_tags = row.find_all('td')
if len(td_tags) > 0:
n_rows+=1
if n_columns == 0:
# Set the number of columns for our table
n_columns = len(td_tags)
# Handle column names if we find them
th_tags = row.find_all('th')
if len(th_tags) > 0 and len(column_names) == 0:
for th in th_tags:
column_names.append(th.get_text())
# Safeguard on Column Titles
if len(column_names) > 0 and len(column_names) != n_columns:
raise Exception("Column titles do not match the number of columns")
columns = column_names if len(column_names) > 0 else range(0,n_columns)
df = pd.DataFrame(columns = columns,
index= range(0,n_rows))
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
df.iat[row_marker,column_marker] = column.get_text()
column_marker += 1
if len(columns) > 0:
row_marker += 1
# Convert to float if possible
for col in df:
try:
df[col] = df[col].astype(float)
except ValueError:
pass
return df
If the data that you need is just the table, you can accomplish that with pandas.read_html() function.

Python 3.x unsupported operand type in using encode decode

I am trying to build a generic crawler for my marketing project and keep track of where the information came from viz blogs, testimonials etc. I am using Python 3.5 and Spyder/pycharm as IDE and I keep getting the following error in using encode - decode. The input to my code is a list of company names and product features in an excel file. I also searched for possible solutions but the recommendations in the community are for typecasting, which I am not sure is the problem.
Kindly let me know if some more clarification is required from my side.
from __future__ import division, unicode_literals
import codecs
import re
import os
import xlrd
import requests
from urllib.request import urlopen
from time import sleep
from bs4 import BeautifulSoup
import openpyxl
from collections import Counter
page=0
b=0
n=0
w=0
p=0
o=0
workbook=xlrd.open_workbook("C:\Product.xlsx")
workbook1=xlrd.open_workbook("C:\linkslist.xlsx")
sheet_names = workbook.sheet_names()
sheet_names1 = workbook1.sheet_names()
wb= openpyxl.Workbook() #User Spreadsheet
ws = wb.active
ws.title = "User"
ws['A1'] = 'Feature'
ws['B1'] = 'Customer-Testimonials'
ws['C1'] = 'Case Study'
ws['D1'] = 'Blog'
ws['E1'] = 'Press'
ws['F1'] = 'Total posts'
ws1 = wb.create_sheet(title="Ml")
ws1['A1'] = 'Feature'
ws1['B1'] = 'Phrase'
ws1['C1'] = 'Address'
ws1['D1'] = 'Tag Count'
worksheet = workbook.sheet_by_name(sheet_names[0])
worksheet1 = workbook1.sheet_by_name(sheet_names[0])
for linknumber in range(0,25):
u = worksheet1.cell(linknumber,0).value
url='www.' + u.lower() + '.com'
print (url)
r=''
while r == '':
try:
print ("in loop")
r = requests.get("http://" +url)
except:
sleep(3)#if the code still gives that error then try increasing the sleep time to 5 maybe
print (r)
data = r.text
#print data
soup1 = BeautifulSoup(data, "html.parser")
#print soup1
num=3 #starting row number and keep the column same.
word = ''
word = worksheet.cell(num,3).value
while not word == 'end':
print (num)
#print word
tag_list=[]
phrase= []
counts=[]
address=[]
counts = Counter(tag_list)
for link in soup1.find_all('a'):
#print link
add = link.encode("ascii", "ignore")
print (add)
if not'Log In' in add:
#print link.get('href')
i=0
content = ''
for i in range(1,5):
if content=='':
try:
print (link.get('href'))
i+=1
req = urllib.request.Request(link.get('href'))
with urllib.request.urlopen(req) as response:
content = response.read()
except:
sleep(3)
#if the code still gives that error then try increasing the sleep time to 5 maybe
continue
soup = BeautifulSoup(content, "html.parser")
s=soup(text=re.compile(word))
if s:
print ("TRUE")
add = link.encode('ascii','ignore')
print (type(add))
if 'customer-testimonial' in add :
b+=1
elif 'case-study' in add :
n+=1
elif 'blog' in add :
w+=1
elif 'press' in add :
p+=1
else :
o+=1
#phrase_type=["Customer testimonials","news","ads","twitter","facebook","instagram"]
#print(os.path.join(root, name))
print (add)
for tag in s:
parent_html = tag.parent.name
print (parent_html)
tag_list.append(parent_html)
phrase.append(s)
address.append(add)
#print str(phrase)
counts = Counter(tag_list)
page +=1
else:
counts = Counter(tag_list)
no =num-1
print(counts)
print (word)
ws['A%d'%no] = word.encode('utf-8' , 'ignore')
ws1['A%d'%no] = word.encode('utf-8' , 'ignore')
print ("Number of pages is %d" %page)
print ("Number of Customer testimonials posts is %d" %b)
ws['B%d'%no] = b
print ("Number of Case Studies posts is %d" %n)
ws['C%d'%no] = n
print ("Number of blog posts is %d" %w)
ws['D%d'%no] = w
print ("Number of press posts is %d" %p)
ws['E%d'%no] = p
print ("Number of posts is %d" %page)
ws['F%d'%no] = page
ws1['B%d'%no] = phrase.encode('utf-8' , 'ignore')
ws1['C%d'%no] = address.encode('utf-8' , 'ignore')
ws1['D%d'%no] = counts.encode('utf-8' , 'ignore')
counts.clear()
num += 1
word = worksheet.cell(num,3).value
#print word
page=0
b=0
n=0
w=0
p=0
o=0
phrase=[]
address=[]
tag_list=[]
wb.save('%s.xlsx'%worksheet1.cell(linknumber,0).value)
I get the following output and error while running the code:
www.amobee.com
in loop
<Response [200]>
3
Traceback (most recent call last):
File "C:/project_web_parser.py", line 69, in <module>
add = link.encode("ascii", "ignore")
File "C:\ProgramData\Ana3\lib\site-packages\bs4\element.py", line 1094, in encode
u = self.decode(indent_level, encoding, formatter)
File "C:\ProgramData\Ana3\lib\site-packages\bs4\element.py", line 1159, in decode
indent_space = (' ' * (indent_level - 1))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
Process finished with exit code 1
Traceback shows error in line 69 where you try to encode link. To fix it, just change that line to:
add = link.encode("ascii", errors="ignore")
Why does it happen?
Your link variable is type of bs4.element.Tag
>>>type(link)
<class 'bs4.element.Tag'>
.encode() method for tags takes more arguments then .encode() method for strings.
In source code of bs4 in file \bs4\element.py on line 1089 you can find definition of it:
def encode(self, encoding=DEFAULT_OUTPUT_ENCODING,
indent_level=None, formatter="minimal",
errors="xmlcharrefreplace"):
First argument is encoding, second is indent_level (int or None) and errors handling is forth.
Error
unsupported operand type(s) for -: 'str' and 'int'
means that you tried to subtract 'ignore' - 1.

Resources