I want to loop though a number of url requests but I'm having an issue, when I run the code it doesnt loop though all the numbers and only requests the page once with the output.
So my question is, how do I loop though the numbers in the url while requesting each one + capturing all the source code I need. Everything works except the counting.
this is the code I have managed to write so far
import urllib.request
import re
count = 1
while count <= 44:
url = ('https://example/page=%i' % count)
count += 1
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req)
respData = resp.read()
paragraphs = re.findall(r';"><span>(.*?)</span></a></td><td',str(respData))
for eachP in paragraphs:
print(eachP)
You have problem in your loop you iterate over all 44 urls, but request only last one url, this is correct implementation:
import urllib.request
import re
def handle_url(url):
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req)
respData = resp.read()
paragraphs = re.findall(r';"><span>(.*?)</span></a></td><td',str(respData))
for eachP in paragraphs:
print(eachP)
count = 1
while count <= 44:
handle_url('https://example/page=%i' % count)
count += 1
if you don't like function you can simply move all to while loop:
import urllib.request
import re
def handle_url(url):
count = 1
while count <= 44:
url = 'https://example/page=%i' % count
count += 1
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req)
respData = resp.read()
paragraphs = re.findall(r';"><span>(.*?)</span></a></td><td',str(respData))
for eachP in paragraphs:
print(eachP)
Related
Code works well when I hardcode the nodes (e.g. node1), but not when I use user input - it always returns 0 instead of counting the numbers which are "node3". Here is the page I am using http://py4e-data.dr-chuck.net/comments_678016.xml - node1 = comments, node2 = comment, node3= count. Any suggestions?
import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input ("Which url?\n")
node1 = input ("Enter node1- ")
node2 = input ("Enter node2- ")
node3 = input ("Enter node3- ")
count = 0
try:
html = urllib.request.urlopen(url, context=ctx).read()
tree = ET.fromstring(html)
x = tree.findall(node1/node2)
for item in x:
c = int(item.find(node3).text)
count = count + c
print(count)
except:
print("Please only input complete urls")
Putting aside the user input angle, if you want "to sum up all numbers under "count"", change your xpath expression to
x = tree.findall('.//comment/count')
and then either do it the long way (which I personally prefer):
total = 0
for count in x:
total += int(count.text)
or use list comprehensions:
sum([int(count.text) for count in x])
In either case, the output is
2348
Found out what the mistake was - needed to concatenate strings:
x = tree.findall(node1 + "/" + node2)
import urllib.request, urllib.parse, urllib.error
import xml.etree.ElementTree as ET
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input("Which url?\n")
node1 = input("Enter node1- ")
node2 = input("Enter node2- ")
node3 = input("Enter node3- ")
count = 0
try:
html = urllib.request.urlopen(url, context=ctx).read()
tree = ET.fromstring(html)
x = tree.findall(node1 + "/" + node2)
for item in x:
count += int(item.find(node3).text)
print(count)
except:
print("Please only input complete urls")
I have a method that iterates through a load of values and send a requests call for each. This works fine however it takes a LONG time. I want to speed this up using multiprocessing in Python3.
My question is what is the correct way to implement this using multiprocessing so I can speed up these requets?
The existing method is as so:
def change_password(id, prefix_length):
count = 0
# e = None
# m = 0
for word in map(''.join, itertools.product(string.ascii_lowercase, repeat=int(prefix_length))):
# code = f'{word}{id}'
h = hash_string(word)[0:10]
url = f'http://atutor/ATutor/confirm.php?id=1&m={h}&member_id=1&auto_login=1'
print(f"Issuing reset request with: {h}")
proxies = {'http': 'http://127.0.0.1:8080', 'https': 'http://127.0.0.1:8080'}
s = requests.Session()
res = s.get(url, proxies=proxies, headers={"Accept": "*/*"})
if res.status_code == 302:
return True, s.cookies, count
else:
count += 1
return False, None, count
I have tried to convert it based on the first example in the Python3 docs here however it just seems to hang when I start the script. No doubt I have implemented it wrong
def send_request(word):
# code = f'{word}{id}'
h = hash_string(word)[0:10]
url = f'http://atutor/ATutor/confirm.php?id=1&m={h}&member_id=1&auto_login=1'
print(f"Issuing reset request with: {h}")
proxies = {'http': 'http://127.0.0.1:8080', 'https': 'http://127.0.0.1:8080'}
s = requests.Session()
res = s.get(url, proxies=proxies, headers={"Accept": "*/*"})
if res.status_code == 302:
return True, word
else:
return False
It is being called from a modified version of the original method as so:
def change_password(id, prefix_length):
count = 0
# e = None
# m = 0
words = map(''.join, itertools.product(string.ascii_lowercase, repeat=int(prefix_length)))
with Pool(10) as pool:
results, word = pool.map(send_request, words)
if results:
return True, word
I am getting the below error when I am downloading files using multiprocessing. I am downloading Wikipedia page views and they have it by hour so it might include a lot of downloading.
Any recommendation to why this error is caused and HOW TO SOLVE IT? Thanks
MaybeEncodingError: Error sending result:
''. Reason: 'TypeError("cannot serialize
'_io.BufferedReader' object",)'
import fnmatch
import requests
import urllib.request
from bs4 import BeautifulSoup
import multiprocessing as mp
def download_it(download_file):
global path_to_save_document
filename = download_file[download_file.rfind("/")+1:]
save_file_w_submission_path = path_to_save_document + filename
request = urllib.request.Request(download_file)
response = urllib.request.urlopen(request)
data_content = response.read()
with open(save_file_w_submission_path, 'wb') as wf:
wf.write(data_content)
print(save_file_w_submission_path)
pattern = r'*200801*'
url_to_download = r'https://dumps.wikimedia.org/other/pagecounts-raw/'
path_to_save_document = r'D:\Users\Jonathan\Desktop\Wikipedia\\'
def main():
global pattern
global url_to_download
r = requests.get(url_to_download)
data = r.text
soup = BeautifulSoup(data,features="lxml")
list_of_href_year = []
for i in range(2):
if i == 0:
for link in soup.find_all('a'):
lien = link.get('href')
if len(lien) == 4:
list_of_href_year.append(url_to_download + lien + '/')
elif i == 1:
list_of_href_months = []
list_of_href_pageviews = []
for loh in list_of_href_year:
r = requests.get(loh)
data = r.text
soup = BeautifulSoup(data,features="lxml")
for link in soup.find_all('a'):
lien = link.get('href')
if len(lien) == 7:
list_of_href_months.append(loh + lien + '/')
if not list_of_href_months:
continue
for lohp in list_of_href_months:
r = requests.get(lohp)
data = r.text
soup = BeautifulSoup(data,features="lxml")
for link in soup.find_all('a'):
lien = link.get('href')
if "pagecounts" in lien:
list_of_href_pageviews.append(lohp + lien)
matching_list_of_href = fnmatch.filter(list_of_href_pageviews, pattern)
matching_list_of_href.sort()
with mp.Pool(mp.cpu_count()) as p:
print(p.map(download_it, matching_list_of_href))
if __name__ == '__main__':
main()
As Darkonaut proposed. I used multithreading instead.
Example:
from multiprocessing.dummy import Pool as ThreadPool
'''This function is used for the download the files using multi threading'''
def multithread_download_files_func(self,download_file):
try:
filename = download_file[download_file.rfind("/")+1:]
save_file_w_submission_path = self.ptsf + filename
'''Check if the download doesn't already exists. If not, proceed otherwise skip'''
if not os.path.exists(save_file_w_submission_path):
data_content = None
try:
'''Lets download the file'''
request = urllib.request.Request(download_file)
response = urllib.request.urlopen(request)
data_content = response.read()
except urllib.error.HTTPError:
'''We will do a retry on the download if the server is temporarily unavailable'''
retries = 1
success = False
while not success:
try:
'''Make another request if the previous one failed'''
response = urllib.request.urlopen(download_file)
data_content = response.read()
success = True
except Exception:
'''We will make the program wait a bit before sending another request to download the file'''
wait = retries * 5;
time.sleep(wait)
retries += 1
except Exception as e:
print(str(e))
'''If the response data is not empty, we will write as a new file and stored in the data lake folder'''
if data_content:
with open(save_file_w_submission_path, 'wb') as wf:
wf.write(data_content)
print(self.present_extract_RC_from_RS + filename)
except Exception as e:
print('funct multithread_download_files_func' + str(e))
'''This function is used as a wrapper before using multi threading in order to download the files to be stored in the Data Lake'''
def download_files(self,filter_files,url_to_download,path_to_save_file):
try:
self.ptsf = path_to_save_file = path_to_save_file + 'Step 1 - Data Lake\Wikipedia Pagecounts\\'
filter_files_df = filter_files
self.filter_pattern = filter_files
self.present_extract_RC_from_RS = 'WK Downloaded-> '
if filter_files_df == '*':
'''We will create a string of all the years concatenated together for later use in this program'''
reddit_years = [2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018]
filter_files_df = ''
'''Go through the years from 2005 to 2018'''
for idx, ry in enumerate(reddit_years):
filter_files_df += '*' + str(ry) + '*'
if (idx != len(reddit_years)-1):
filter_files_df += '&'
download_filter = list([x.strip() for x in filter_files_df.split('&')])
download_filter.sort()
'''If folder doesn't exist, create one'''
if not os.path.exists(os.path.dirname(self.ptsf)):
os.makedirs(os.path.dirname(self.ptsf))
'''We will get the website HTML elements using beautifulsoup library'''
r = requests.get(url_to_download)
data = r.text
soup = BeautifulSoup(data,features="lxml")
list_of_href_year = []
for i in range(2):
if i == 0:
'''Lets get all href available on this particular page. The first page is the year page'''
for link0 in soup.find_all('a'):
lien0 = link0.get('href')
'''We will check if the length is 4 which corresponds to a year'''
if len(lien0) == 4:
list_of_href_year.append(url_to_download + lien0 + '/')
elif i == 1:
list_of_href_months = []
list_of_href_pageviews = []
for loh in list_of_href_year:
r1 = requests.get(loh)
data1 = r1.text
'''Get the webpage HTML Tags'''
soup1 = BeautifulSoup(data1,features="lxml")
for link1 in soup1.find_all('a'):
lien1 = link1.get('href')
'''We will check if the length is 7 which corresponds to the year and month'''
if len(lien1) == 7:
list_of_href_months.append(loh + lien1 + '/')
for lohm in list_of_href_months:
r2 = requests.get(lohm)
data2 = r2.text
'''Get the webpage HTML Tags'''
soup2 = BeautifulSoup(data2,features="lxml")
for link2 in soup2.find_all('a'):
lien2 = link2.get('href')
'''We will now get all href that contains pagecounts in their name. We will have the files based on Time per hour. So 24 hrs is 24 files
and per year is 24*365=8760 files in minimum'''
if "pagecounts" in lien2:
list_of_href_pageviews.append(lohm + lien2)
existing_file_list = []
for file in os.listdir(self.ptsf):
filename = os.fsdecode(file)
existing_file_list.append(filename)
'''Filter the links'''
matching_fnmatch_list = []
if filter_files != '':
for dfilter in download_filter:
fnmatch_list = fnmatch.filter(list_of_href_pageviews, dfilter)
i = 0
for fnl in fnmatch_list:
'''Break for demo purpose only'''
if self.limit_record != 0:
if (i == self.limit_record) and (i != 0):
break
i += 1
matching_fnmatch_list.append(fnl)
'''If the user stated a filter, we will try to remove the files which are outside that filter in the list'''
to_remove = []
for efl in existing_file_list:
for mloh in matching_fnmatch_list:
if efl in mloh:
to_remove.append(mloh)
'''Lets remove the files which has been found outside the filter'''
for tr in to_remove:
matching_fnmatch_list.remove(tr)
matching_fnmatch_list.sort()
'''Multi Threading of 200'''
p = ThreadPool(200)
p.map(self.multithread_download_files_func, matching_fnmatch_list)
except Exception as e:
print('funct download_files' + str(e))
From the accepted answer, I understood that it is simply replacing from multiprocessing import Pool by from multiprocessing.dummy import Pool.
This worked for me.
I have a website that enlists table, I need to scrape the table. In that table the email addresses present can be seen only when we open them in new tab but they are present in the html script of the page. I am unable to scrape the emails.
class HTMLTableParser:
def parse_url(self,url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
return[(table['id'], self.parse_html_table(table))\
for table in soup.find_all('table')]
def parse_html_table(self,table):
n_columns = 0
n_rows = 0
column_names = []
for row in table.find_all('tr'):
td_tags = row.find_all('td')
if len(td_tags)>0:
n_rows+=1
if n_columns == 0:
n_columns = len(td_tags)
th_tags = row.find_all('th')
if len(th_tags) > 0 and len(column_names) == 0:
for th in th_tags:
column_names.append(th.get_text())
if len(column_names) > 0 and len(column_names) != n_columns:
raise Exception("Column titles do not match the number of columns")
columns = column_names if len(column_names) > 0 else range(0, n_columns)
df = pd.DataFrame(columns = columns,
index = range(0, n_rows))
row_marker = 0
for row in table.find_all('tr'):
column_marker = 0
columns = row.find_all('td')
for column in columns:
df.iat[row_marker, column_marker] = column.get_text()
column_marker += 1
if len(columns) > 0:
row_marker += 1
for col in df:
try:
df[col] = df[col].astype(float)
except ValueError:
pass
return df
EDIT: Originally I only answered how to get at just the email. Answer adjusted to get the email with all the other data.
EDIT 2: Compatible with BS4 4.6 series.
The email cannot be acquired because it is found in the href of an anchor. We will extract the email from the anchor if it is found, else we will extract the the text from the cell if the email is not.
As it isn't exactly clear to me 100% what the code's end goal is, this is only looking at extracting all the cells, with am emphasis on acquiring the email that was not captured in the original code.
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = 'https://www.adelaide.edu.au/directory/atoz?dsn=directory.phonebook;orderby=last%2Cfirst%2Cposition_n;m=atoz;page=;perpage=50'
def parse_url(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
return [(table['id'], parse_html_table(table)) for table in soup.find_all('table')]
def parse_html_table(table):
n_columns = 0
n_rows = 0
column_names = []
column_names = [th.get_text() for th in table.select('th')]
n_columns = len(column_names)
rows = table.select('tr')[1:]
n_rows = len(rows)
df = pd.DataFrame(columns=column_names, index=range(n_rows))
r_index = 0
for row in rows:
c_index = 0
for cell in row.select('td'):
if cell.get('data-th') == 'Email':
anchor = cell.select_one('a')
df.iat[r_index, c_index] = anchor.get('href').replace('mailto:', '') if anchor else cell.get_text()
else:
df.iat[r_index, c_index] = cell.get_text()
c_index += 1
r_index += 1
return df
print(parse_url(url))
from geopy.geocoders import Nominatim
import openpyxl
wb = openpyxl.load_workbook('#######.xlsx')
ws = wb.active
geolocator = Nominatim(timeout=60)
for i in range(2,1810):
count1 = 0
count2 = 1
address = str(ws['B'+str(i)].value)
city = str(ws['C'+str(i)].value)
state = str(ws['D'+str(i)].value)
zipc = str(ws['F'+str(i)].value)
result = None
iden1 = address + ' ' + city + ' ' + state
iden2 = city + ' ' + zipc + ' ' + state
iden3 = city + ' ' + state
print(iden1, iden2, iden3)
print(geolocator.geocode(iden2).address)
try:
location1 = geolocator.geocode(iden1)
except:
pass
try:
location2 = geolocator.geocode(iden2)
except:
pass
try:
location3 = geolocator.geocode(iden3)
except:
pass
count = None
try:
county1 = str(location1.address)
county1_list = county1.split(", ")
#print(county1_list)
for q in county1_list:
if 'county' in q.lower():
if count == None:
count = q
except:
pass
try:
county2 = str(location2.address)
county2_list = county2.split(", ")
#print(county2_list)
for z in county2_list:
if 'county' in z.lower():
if count == None:
count = z
except:
pass
try:
county3 = str(location3.address)
county3_list = county3.split(", ")
#print(county3_list)
for j in county3_list:
if 'county' in j.lower():
if count == None:
count = j
except:
pass
print(i, count)
#ws['E'+str(i)] = count
if count == 50:
#wb.save("#####" +str(count2) +".xlsx")
count2 += 1
count1 = 0
Hello all, this code is pretty simple and uses geopy to extract county names using 3 different methods names iden1, iden2, and iden3 which are a combination of address, city, state, and zipcode. This ran fine for about 300 lines but began to repeat the same county, and after restarting the script, just spat out Nones. I put in the line print (geolocator.geocode(iden2).address) to find the error and got this error message.
Traceback (most recent call last):
File "C:/Users/#####/Downloads/Web content/#####/####_county.py",
line 19, in
print(geolocator.geocode(iden2).address) File "C:\Users#####\AppData\Local\Programs\Python\Python36-32\lib\site-packages\geopy\geocoders\osm.py",
line 193, in geocode
self._call_geocoder(url, timeout=timeout), exactly_one File "C:\Users#####\AppData\Local\Programs\Python\Python36-32\lib\site-packages\geopy\geocoders\base.py",
line 171, in _call_geocoder
raise GeocoderServiceError(message) geopy.exc.GeocoderServiceError: [WinError 10061] No connection could
be made because the target machine actively refused it
This script was working before but now does not. Is my IP being blocked from using goepy's database or something? Thanks for your help!
It looks like you're hitting their ratelimiting. It seems that they ask that you limit your API requests to 1/second. You can take a look here for their usage policy where they list alternatives to using their API as well as contraints.