As far as I can tell, my code works absolutely fine- though it probably looks a bit rudimentary and crude to more experienced eyes.
Objective:
Create a 'filter' that loops through a (large) range of possible ID numbers. Each ID should tried to log-in at the url website. If the id is valid it should be saved to hit_list.
Issue:
In large loops, the programme 'hangs' for indefinite periods of time. Although I have no evidence (no exception is thrown) I suspect this is a timeout issue (or rather, would be if timeout was specified)
Question:
I want to add a timeout- and then handle the timeout exception so that my programme will stop hanging. If this theory is wrong, I would also like to hear what my issue might be.
How to add a timeout is a question that has been asked before: Here and here, but after spending all weekend working on this, I'm still at a loss. Put blunty, I don't understand those answers.
What I've tried:
Create a try & except block in the id_filter function. The try is at r=s.get(url) and the exception is at the end of the function. I've read the requests docs in detail, here and here. This didn't work.
The more I read about futures the more I'm convinced that excepting errors has to be done in futures, rather than requests (as I did above). So I tried inserting a timeout in the brackets after boss.map, but as far as I could tell, this had no effect- it seems too simple anyway.
So, to reiterate:
For large loops (50,000 +) my programme tends to hang for an indefinite period of time (there is no exact point when this starts, though it's usually after 90% of the loop has been processed). I don't know why, but suspect adding a timeout would throw an exception- which I can then except. This theory may, however be wrong. I have tried to add timeout and handle other errors in the requests part, but to no effect.
-Python 3.5
My code:
import concurrent.futures as cf
import requests
from bs4 import BeautifulSoup
hit_list =[]
processed_list=[]
startrange= 100050000
end_range = 100150000
loop_size=range(startrange,end_range)
workers= 70 #
chunks= 300
url = 'https://ndber.seai.ie/pass/ber/search.aspx'
def id_filter(_range):
with requests.session() as s:
s.headers.update({
'user-agent': 'FOR MORE INFORMATION ABOUT THIS DATA COLLECTION PLEASE CONTACT ########'
})
r = s.get(url)
time.sleep(.1)
soup = BeautifulSoup(r.content, 'html.parser')
viewstate = soup.find('input', {'name': '__VIEWSTATE' }).get('value')
viewstategen = soup.find('input', {'name': '__VIEWSTATEGENERATOR' }).get('value')
validation = soup.find('input', {'name': '__EVENTVALIDATION' }).get('value')
for ber in _range:
data = {
'ctl00$DefaultContent$BERSearch$dfSearch$txtBERNumber': ber,
'ctl00$DefaultContent$BERSearch$dfSearch$Bottomsearch': 'Search',
'__VIEWSTATE' : viewstate,
'__VIEWSTATEGENERATOR' : viewstategen,
'__EVENTVALIDATION' : validation,
}
y = s.post(url, data=data)
if 'No results found' in y.text:
#print('Invalid ID', ber)
else:
hit_list.append(ber)
print('Valid ID',ber)
if __name__ == '__main__':
with cf.ThreadPoolExecutor(max_workers=workers) as boss:
jobs= [loop_size[x: x + chunks] for x in range(0, len(loop_size), chunks)]
boss.map(id_filter, jobs)
#record data below
Related
Currently, I'm trying to improve a code that sends multiple HTTP requests to a webpage until it can capture some text (which the code localizes through a known pattern) or until 180 seconds runs out (the time we wait for the page to give us an answer).
This is the part of the code (a little edited for privacy purposes):
if matches == None:
txt = "No answer til now"
print(txt)
Solution = False
start = time.time()
interval = 0
while interval < 180:
response = requests.get("page address")
subject = response.text
matches = re.search(pattern, subject, re.IGNORECASE)
if matches != None:
Solution =matches.group(1)
time = "{:.2f}".format(time.time()-start)
txt = "Found an anwswer "+ Solution + "time needed : "+ time
print(txt)
break
interval = time.time()-start
else:
Solution = matches.group(1)
It runs OK, but I was told that doing "infinite requests in a loop" could cause an hight CPU usage of the server. Do you guys know of something I can use in order to avoid that?
Ps: I heard that in PHP people use curl_multi_select() for things like these. Don't know if I'm correct though.
Usually an HTTP REST API will specify in the documentation how many requests you can make in a given time period against which endpoint resources.
For a website, if you are not hitting a request limit and getting flagged/banned for too many requests, then you should be okay to continuously loop like this, but you may want to introduce a time.sleep call into your while loop.
An alternative to the 180 second timeout:
Since HTTP requests are I/O operations and can take a variable amount of time, you may want to change your exit case for the loop to a certain amount of requests (like 25 or something) and then incorporate the aforementioned sleep call.
That could look like:
# ...
if matches is None:
solution = None
num_requests = 25
start = time.time()
while num_requests:
response = requests.get("page address")
if response.ok: # It's good to attempt to handle potential HTTP/Connectivity errors
subject = response.text
matches = re.search(pattern, subject, re.IGNORECASE)
if matches:
solution = matches.group(1)
elapsed = "{:.2f}".format(time.time()-start)
txt = "Found an anwswer " + solution + "time needed : " + elapsed
print(txt)
break
else:
# Maybe raise an error here?
pass
time.sleep(2)
num_requests -= 1
else:
solution = matches.group(1)
Notes:
Regarding PHP's curl_multi_select - (NOT a PHP expert here...) it seems that this function is designed to allow you to watch multiple connections to different URLs in an asynchronous manner. Async doesn't really apply to your use case here because you are only scraping one webpage (URL), and are just waiting for some data to appear there.
If the response.text you are searching through is HTML and you aren't already using it somewhere else in your code, I would recommend Beautiful Soup or scrapy to (before regex) for searching for string patterns in webpage markup.
I have built a web scraper using bs4, where the purpose is to get notifications when a new announcement is posted. At the moment I am testing this with the word 'list' instead of all announcement keywords. For some reason when I compare the time it determines a new announcement has been posted versus the actual time it was posted in the website, the time is off by 5 minutes give or take.
from bs4 import BeautifulSoup
from requests import get
import time
import sys
x = True
while x == True:
time.sleep(30)
# Data for the scraping
url = "https://www.binance.com/en/support/announcement"
response = get(url)
html_page = response.content
soup = BeautifulSoup(html_page, 'html.parser')
news_list = soup.find_all(class_ = 'css-qinc3w')
# Create a bag of key words for getting matches
key_words = ['list', 'token sale', 'open trading', 'opens trading', 'perpetual', 'defi', 'uniswap', 'airdrop', 'adds', 'updates', 'enabled', 'trade', 'support']
# Empty list
updated_list = []
for news in news_list:
article_text = news.text
if ("list" in article_text.lower()):
updated_list.append([article_text])
if len(updated_list) > 4:
print(time.asctime( time.localtime(time.time()) ))
print(article_text)
sys.exit()
The Response when the length of the list increased by 1 to 5 resulted in printing the following time, and new announcement:
Fri May 28 04:17:39 2021,
Binance Will List Livepeer (LPT)
I am unsure why this is. At first I thought I was being throttled, but looking again at robot.txt, I didn't see any reason why I should be. Moreover I included a sleep time of 30 seconds which should be more than enough to web scrape without any issues. Any help or an alternative solution would be much appreciated.
My question is:
Why is it 5 minutes behind? Why does it not notify me once the website posts it? The program takes 5 minutes longer to recognize there is a new post in comparison to the time it is posted on the website.
from xrzz import http ## give it try using my simple scratch module
import json
url = "https://www.binance.com/bapi/composite/v1/public/cms/article/list/query?type=1&pageNo=1&pageSize=30"
req = http("GET", url=url, tls=True).body().decode()
key_words = ['list', 'token sale', 'open trading', 'opens trading', 'perpetual', 'defi', 'uniswap', 'airdrop', 'adds', 'updates', 'enabled', 'trade', 'support']
for i in json.loads(req)['data']['catalogs']:
for o in i['articles']:
if key_words[0] in o['title']:
print(o['title'])
Ouput:
I think the problem is that the cloudflare server is caching documents
or it was done deliberately by the binance programmers, so that a narrow circle of people could react to the news faster than everyone else.
This is a big problem if you want to get fresh data. If you look at the HTTP headers, you will notice that the Date: header is cached by the server, which means that the entire content of the document is cached. I managed to get 2 different Date: if I add or remove the gzip header.
accept-encoding: gzip, deflate.
I am using the page
https://www.binance.com/bapi/composite/v1/public/cms/article/catalog/list/query?catalogId=48&pageNo=1&pageSize=15
If you change the pageSize parameter, you can get fresh cached responses from the server. But that still doesn't solve the 5 minute delay issue. And I still see the old page.
Your link is :
https://www.binance.com/bapi/composite/v1/public/cms/article/list/query?type=1&pageNo=1&pageSize=30 like mine https://www.binance.com/bapi/composite/v1/public/cms/article/catalog/list/query?catalogId=48&pageNo=1&pageSize=15
is also cached for 5 seconds. And my guess is that there will be a 5 minute delay as well. I have not found a solution to this problem.
I'm trying to understand if it's possible to set a loop inside of a Try/Except call, or if I'd need to restructure to use functions. Long story short, after spending a few hours learning Python and BeautifulSoup, I managed to frankenstein some code together to scrape a list of URLs, pull that data out to CSV (and now update it to a MySQL db). The code is now working as planned, except that I occasionally run into a 10054, either because my VPN hiccups, or possibly the source host server is occasionally bouncing me (I have a 30 second delay in my loop but it still kicks me on occasion).
I get the general idea of Try/Except structure, but I'm not quite sure how I would (or if I could) loop inside it to try again. My base code to grab the URL, clean it and parse the table I need looks like this:
for url in contents:
print('Processing record', (num+1), 'of', len(contents))
if url:
print('Retrieving data from ', url[0])
html = requests.get(url[0]).text
soup = BeautifulSoup(html, 'html.parser')
for span in soup('span'):
span.decompose()
trs = soup.select('div#collapseOne tr')
if trs:
print('Processing')
for t in trs:
for header, value in zip(t.select('td')[0], t.select('td:nth-child(2)')):
if num == 0:
headers.append(' '.join(header.split()))
values.append(re.sub(' +', ' ', value.get_text(' ', strip=True)))
After that is just processing the data to CSV and running an update sql statement.
What I'd like to do is if the HTML request call fails is wait 30 seconds, try the request again, then process, or if the retry fails X number of times, go ahead and exit the script (assuming at that point I have a full connection failure).
Is it possible to do something like that in line, or would I need to make the request statement into a function and set up a loop to call it? Have to admit I'm not familiar with how Python works with function returns yet.
You can add an inner loop for the retries and put your try/except block in that. Here is a sketch of what it would look like. You could put all of this into a function and put that function call in its own try/except block to catch other errors that cause the loop to exit.
Looking at requests exception hierarchy, Timeout covers multiple recoverable exceptions and is a good start for everything you may want to catch. Other things like SSLError aren't going to get better just because you retry, so skip them. You can go through the list to see what is reasonable for you.
import itertools
# requests exceptions at
# https://requests.readthedocs.io/en/master/_modules/requests/exceptions/
for url in contents:
print('Processing record', (num+1), 'of', len(contents))
if url:
print('Retrieving data from ', url[0])
retry_count = itertools.count()
# loop for retries
while True:
try:
# get with timeout and convert http errors to exceptions
resp = requests.get(url[0], timeout=10)
resp.raise_for_status()
# the things you want to recover from
except requests.Timeout as e:
if next(retry_count) <= 5:
print("timeout, wait and retry:", e)
time.sleep(30)
continue
else:
print("timeout, exiting")
raise # reraise exception to exit
except Exception as e:
print("unrecoverable error", e)
raise
break
html = resp.text
etc…
I've done a little example by myself to graphic this, and yes, you can put loops inside try/except blocks.
from sys import exit
def example_func():
try:
while True:
num = input("> ")
try:
int(num)
if num == "10":
print("Let's go!")
else:
print("Not 10")
except ValueError:
exit(0)
except:
exit(0)
example_func()
This is a fairly simple program that takes input and if it's 10, then it says "Let's go!", otherwise it tells you it's not 10 (if it's not a valid value, it just kicks you out).
Notice that inside the while loop I put a try/except block, taking into account the necessary indentations. You can take this program as a model and use it on your favor.
I have the following question, I want to set up a routine to perform iterations inside a dataframe (pandas) to extract longitude and latitude data, after supplying the address using the 'geopy' library.
The routine I created was:
import time
from geopy.geocoders import GoogleV3
import os
arquivo = pd.ExcelFile('path')
df = arquivo.parse("Table1")
def set_proxy():
proxy_addr = 'http://{user}:{passwd}#{address}:{port}'.format(
user='usuario', passwd='senha',
address='IP', port=int('PORTA'))
os.environ['http_proxy'] = proxy_addr
os.environ['https_proxy'] = proxy_addr
def unset_proxy():
os.environ.pop('http_proxy')
os.environ.pop('https_proxy')
set_proxy()
geo_keys = ['AIzaSyBXkATWIrQyNX6T-VRa2gRmC9dJRoqzss0'] # API Google
geolocator = GoogleV3(api_key=geo_keys )
for index, row in df.iterrows():
location = geolocator.geocode(row['NO_LOGRADOURO'])
time.sleep(2)
lat=location.latitude
lon=location.longitude
timeout=10)
address = location.address
unset_proxy()
print(str(lat) + ', ' + str(lon))
The problem I'm having is that when I run the code the following error is thrown:
GeocoderQueryError: Your request was denied.
I tried the creation without passing the key to the google API, however, I get the following message.
KeyError: 'http_proxy'
and if I remove the unset_proxy () statement from within the for, the message I receive is:
GeocoderQuotaExceeded: The given key has gone over the requests limit in the 24 hour period or has submitted too many requests in too short a period of time.
But I only made 5 requests today, and I'm putting a 2-second sleep between requests. Should the period be longer?
Any idea?
api_key argument of the GoogleV3 class must be a string, not a list of strings (that's the cause of your first issue).
geopy doesn't guarantee the http_proxy/https_proxy env vars to be respected (especially the runtime modifications of the os.environ). The advised (by docs) usage of proxies is:
geolocator = GoogleV3(proxies={'http': proxy_addr, 'https': proxy_addr})
PS: Please don't ever post your API keys to the public. I suggest to revoke the key you've posted in the question and generate a new one, to prevent the possibility of it being abused by someone else.
I am attempting to make a few thousand dns queries. I have written my script to use python-adns. I have attempted to add threading and queue's to ensure the script runs optimally and efficiently.
However, I can only achieve mediocre results. The responses are choppy/intermittent. They start and stop, and most times pause for 10 to 20 seconds.
tlock = threading.Lock()#printing to screen
def async_dns(i):
s = adns.init()
for i in names:
tlock.acquire()
q.put(s.synchronous(i, adns.rr.NS)[0])
response = q.get()
q.task_done()
if response == 0:
dot_net.append("Y")
print(i + ", is Y")
elif response == 300:
dot_net.append("N")
print(i + ", is N")
tlock.release()
q = queue.Queue()
threads = []
for i in range(100):
t = threading.Thread(target=async_dns, args=(i,))
threads.append(t)
t.start()
print(threads)
I have spent countless hours on this. I would appreciate some input from expedienced pythonista's . Is this a networking issue ? Can this bottleneck/intermittent responses be solved by switching servers ?
Thanks.
Without answers to the questions, I asked in comments above, I'm not sure how well I can diagnose the issue you're seeing, but here are some thoughts:
It looks like each thread is processing all names instead of just a portion of them.
Your Queue seems to be doing nothing at all.
Your lock seems to guarantee that you actually only do one query at a time (defeating the purpose of having multiple threads).
Rather than trying to fix up this code, might I suggest using multiprocessing.pool.ThreadPool instead? Below is a full working example. (You could use adns instead of socket if you want... I just couldn't easily get it installed and so stuck with the built-in socket.)
In my testing, I also sometimes see pauses; my assumption is that I'm getting throttled somewhere.
import itertools
from multiprocessing.pool import ThreadPool
import socket
import string
def is_available(host):
print('Testing {}'.format(host))
try:
socket.gethostbyname(host)
return False
except socket.gaierror:
return True
# Test the first 1000 three-letter .com hosts
hosts = [''.join(tla) + '.com' for tla in itertools.permutations(string.ascii_lowercase, 3)][:1000]
with ThreadPool(100) as p:
results = p.map(is_available, hosts)
for host, available in zip(hosts, results):
print('{} is {}'.format(host, 'available' if available else 'not available'))