Currently, I'm trying to improve a code that sends multiple HTTP requests to a webpage until it can capture some text (which the code localizes through a known pattern) or until 180 seconds runs out (the time we wait for the page to give us an answer).
This is the part of the code (a little edited for privacy purposes):
if matches == None:
txt = "No answer til now"
print(txt)
Solution = False
start = time.time()
interval = 0
while interval < 180:
response = requests.get("page address")
subject = response.text
matches = re.search(pattern, subject, re.IGNORECASE)
if matches != None:
Solution =matches.group(1)
time = "{:.2f}".format(time.time()-start)
txt = "Found an anwswer "+ Solution + "time needed : "+ time
print(txt)
break
interval = time.time()-start
else:
Solution = matches.group(1)
It runs OK, but I was told that doing "infinite requests in a loop" could cause an hight CPU usage of the server. Do you guys know of something I can use in order to avoid that?
Ps: I heard that in PHP people use curl_multi_select() for things like these. Don't know if I'm correct though.
Usually an HTTP REST API will specify in the documentation how many requests you can make in a given time period against which endpoint resources.
For a website, if you are not hitting a request limit and getting flagged/banned for too many requests, then you should be okay to continuously loop like this, but you may want to introduce a time.sleep call into your while loop.
An alternative to the 180 second timeout:
Since HTTP requests are I/O operations and can take a variable amount of time, you may want to change your exit case for the loop to a certain amount of requests (like 25 or something) and then incorporate the aforementioned sleep call.
That could look like:
# ...
if matches is None:
solution = None
num_requests = 25
start = time.time()
while num_requests:
response = requests.get("page address")
if response.ok: # It's good to attempt to handle potential HTTP/Connectivity errors
subject = response.text
matches = re.search(pattern, subject, re.IGNORECASE)
if matches:
solution = matches.group(1)
elapsed = "{:.2f}".format(time.time()-start)
txt = "Found an anwswer " + solution + "time needed : " + elapsed
print(txt)
break
else:
# Maybe raise an error here?
pass
time.sleep(2)
num_requests -= 1
else:
solution = matches.group(1)
Notes:
Regarding PHP's curl_multi_select - (NOT a PHP expert here...) it seems that this function is designed to allow you to watch multiple connections to different URLs in an asynchronous manner. Async doesn't really apply to your use case here because you are only scraping one webpage (URL), and are just waiting for some data to appear there.
If the response.text you are searching through is HTML and you aren't already using it somewhere else in your code, I would recommend Beautiful Soup or scrapy to (before regex) for searching for string patterns in webpage markup.
Related
I'm learning web scraping with python, and as a way to do an all-in-one exercise I'm trying to make a game catalog by utilizing Beautiful Soup and requests modules as my main tools. Though, the problem lies while handling sentences related to requests module.
DESCRIPTION:
The exercise is about getting all genres tags used for classifying games starting with A letter in the first page. Each page shows around or exactly 30 games, so if one wants to access a specific page independently of a letter, has to access to an url in this form.
https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/1
https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/2
https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/3
And so on...
As a matter of fact, each alphabet letter main page has the form:
URL: https://vandal.elespanol.com/juegos/13/pc/letra/ which is equivalent to https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/.
Making my way to scrape genres from some pages is not big deal but what if i want to scrape them all of a letter, how do i know when I'm done scraping genres from all games of a letter?
When you request the url https://vandal.elespanol.com/juegos/13/pc/letra/a/inicio/200 for example, you get redirected to a corresponding letter main page, which means the first 30 games, since in the end it doesn't have more games to return.
So while bearing that in mind.. i was thinking about verifying the status_code got from requests.get() response, but get a 200 as status code whereas when analizing packages received with Chrome Tools i got 301 as status code. In the end of the program i save to a file the scraped genres.
Here's the picture.
And here's the code:
from bs4 import BeautifulSoup
import string
import requests
from string import ascii_lowercase
def write_genres_to_file(site_genres):
with open('/home/l0new0lf/Desktop/generos.txt', 'w') as file_:
print(f'File "{file_.name}" OPENED to write {len(site_genres)} GENRES')
counter = 1
site_genres_length = len(site_genres)
for num in range(site_genres_length):
print('inside File Loop')
if counter != 2:
if counter == 3:
file_.write(f'{site_genres[num]}' + '\n')
print('wrote something')
counter = 0
else: file_.write(f'{site_genres[num]}')
else: file_.write(f'{site_genres[num]:^{len(site_genres[num])+8}}')
print(f'Wrote genre "{site_genres[num]}" SUCCESSFULLY!')
counter +=1
def get_tags():
#TITLE_TAG_SELECTOR = 'tr:first-child td.ta14b.t11 div a strong'
#IMG_TAG_SELECTOR = 'tr:last-child td:first-child a img'
#DESCRIPTION_TAG_SELECTOR = 'tr:last-child td:last-child p'
GENRES_TAG_SELECTOR = 'tr:last-child td:last-child div.mt05 p'
GAME_SEARCH_RESULTS_TABLE_SELECTOR = 'table.mt1.tablestriped4.froboto_real.blanca'
GAME_TABLES_CLASS = 'table transparente tablasinbordes'
site_genres = []
for i in ['a']:
counter = 1
while True:
rq = requests.get(f'https://vandal.elespanol.com/juegos/13/pc/letra/{i}/inicio/{counter}')
if rq:
print('Request GET: from ' + f'https://vandal.elespanol.com/juegos/13/pc/letra/{i}/inicio/{counter}' + ' Got Workable Code !')
if rq.status_code == 301 or rq.status_code == 302 or rq.status_code == 303 or rq.status_code == 304:
print(f'No more games in letter {i}\n**REDIRECTING TO **')
break
counter +=1
soup = BeautifulSoup(rq.content, 'lxml')
main_table = soup.select_one(GAME_SEARCH_RESULTS_TABLE_SELECTOR)
#print('This is the MAIN TABLE:\n' + str(main_table))
game_tables = main_table.find_all('table', {'class': GAME_TABLES_CLASS})
#print('These are the GAME TABLES:\n' + str(game_tables))
for game in game_tables:
genres_str = str(game.select_one(GENRES_TAG_SELECTOR).contents[1]).strip().split(' / ')
for genre in genres_str:
if not genre in site_genres:
site_genres.append(genre)
write_genres_to_file(site_genres)
get_tags()
So, roughly, my question is: How do could i know when i'm done scraping all games starting with a certain letter in order to start scraping the games from the next one?.
NOTE: I only could think of comparing every time in the loop if returned html structure is the same compared with first page of a letter or maybe evaluating if I'm receiving repeated games. But i think this shouldn't the way i go about.
Any help is truly welcomed, and I'm very sorry for the very looong problem description, but thought that it was necessary.
I simply would not just rely on a status code. You might get a non 200 status even for pages that are there. For example if you exceed a certain amount described in their robots.txt, or if your network has a delay or error.
So, to reply to your question: "How do I ensure that I scraped all pages corresponding to a certain letter?". To ensure it you may save all the "visible text" as in this reply BeautifulSoup Grab Visible Webpage Text and hash its content. When you hit the same hash, then you know that you already crawled/scraped that page. Therefore you can then incrementally go on the next letter.
As an example of hash snippet, I would use the following:
def from_text_to_hash(url: str) -> str:
""" Getting visible-text and hashing it"""
url_downloaded = urllib.request.urlopen(url)
soup = BeautifulSoup(url_downloaded, "lxml")
visible_text = soup.title.text + "\t" + soup.body.text
current_hash = str(hash(visible_text))
return current_hash
And you keep track in a set of the variable current_hash
I need to do an unlimited HTTP requests from a web API one after another and make it work efficiently and quite fast. (I need it for a utility so it should work no matter how many time im using it, also it should be able to be used on a web server(people use at the same time))
right now I'm using a threading with a queue but after a while of doing it I'm getting errors like:
'cant start a new thread'
'MemoryError'
or it may work a bit, but pretty slow.
this is a part of my code:
concurrent = 25
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=receiveJson)
t.daemon = True
t.start()
for url in get_urls():
q.put(url.strip())
q.join()
*get_urls() is a simple function that returns a list of urls(unknown length)
this is my recieveJson(thread target):
def receiveJson():
while True:
url = q.get()
res = request.get(url).json()
q.task_done()
The problem is coming from your Threads never ending, notice that there is no exit condition in your receiveJson function. The simplest way to signal it should end is usually by enqueuing None:
def receiveJson():
while True:
url = q.get()
if url is None: # Exit condition allows thread to complete
q.task_done()
break
res = request.get(url).json()
q.task_done()
and then you can change the other code as follows:
concurrent = 25
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=receiveJson)
t.daemon = True
t.start()
for url in get_urls():
q.put(url.strip())
for i in range(concurrent):
q.put(None) # Add a None for each thread to be able to get and complete
q.join()
There are other ways of doing this, but this is the how to do it with the least amount of change to your code. If this is happening often, it might be worth looking into the concurrent.futures.ThreadPoolExecutor class to avoid the cost of opening threads very often.
I am attempting to make a few thousand dns queries. I have written my script to use python-adns. I have attempted to add threading and queue's to ensure the script runs optimally and efficiently.
However, I can only achieve mediocre results. The responses are choppy/intermittent. They start and stop, and most times pause for 10 to 20 seconds.
tlock = threading.Lock()#printing to screen
def async_dns(i):
s = adns.init()
for i in names:
tlock.acquire()
q.put(s.synchronous(i, adns.rr.NS)[0])
response = q.get()
q.task_done()
if response == 0:
dot_net.append("Y")
print(i + ", is Y")
elif response == 300:
dot_net.append("N")
print(i + ", is N")
tlock.release()
q = queue.Queue()
threads = []
for i in range(100):
t = threading.Thread(target=async_dns, args=(i,))
threads.append(t)
t.start()
print(threads)
I have spent countless hours on this. I would appreciate some input from expedienced pythonista's . Is this a networking issue ? Can this bottleneck/intermittent responses be solved by switching servers ?
Thanks.
Without answers to the questions, I asked in comments above, I'm not sure how well I can diagnose the issue you're seeing, but here are some thoughts:
It looks like each thread is processing all names instead of just a portion of them.
Your Queue seems to be doing nothing at all.
Your lock seems to guarantee that you actually only do one query at a time (defeating the purpose of having multiple threads).
Rather than trying to fix up this code, might I suggest using multiprocessing.pool.ThreadPool instead? Below is a full working example. (You could use adns instead of socket if you want... I just couldn't easily get it installed and so stuck with the built-in socket.)
In my testing, I also sometimes see pauses; my assumption is that I'm getting throttled somewhere.
import itertools
from multiprocessing.pool import ThreadPool
import socket
import string
def is_available(host):
print('Testing {}'.format(host))
try:
socket.gethostbyname(host)
return False
except socket.gaierror:
return True
# Test the first 1000 three-letter .com hosts
hosts = [''.join(tla) + '.com' for tla in itertools.permutations(string.ascii_lowercase, 3)][:1000]
with ThreadPool(100) as p:
results = p.map(is_available, hosts)
for host, available in zip(hosts, results):
print('{} is {}'.format(host, 'available' if available else 'not available'))
As far as I can tell, my code works absolutely fine- though it probably looks a bit rudimentary and crude to more experienced eyes.
Objective:
Create a 'filter' that loops through a (large) range of possible ID numbers. Each ID should tried to log-in at the url website. If the id is valid it should be saved to hit_list.
Issue:
In large loops, the programme 'hangs' for indefinite periods of time. Although I have no evidence (no exception is thrown) I suspect this is a timeout issue (or rather, would be if timeout was specified)
Question:
I want to add a timeout- and then handle the timeout exception so that my programme will stop hanging. If this theory is wrong, I would also like to hear what my issue might be.
How to add a timeout is a question that has been asked before: Here and here, but after spending all weekend working on this, I'm still at a loss. Put blunty, I don't understand those answers.
What I've tried:
Create a try & except block in the id_filter function. The try is at r=s.get(url) and the exception is at the end of the function. I've read the requests docs in detail, here and here. This didn't work.
The more I read about futures the more I'm convinced that excepting errors has to be done in futures, rather than requests (as I did above). So I tried inserting a timeout in the brackets after boss.map, but as far as I could tell, this had no effect- it seems too simple anyway.
So, to reiterate:
For large loops (50,000 +) my programme tends to hang for an indefinite period of time (there is no exact point when this starts, though it's usually after 90% of the loop has been processed). I don't know why, but suspect adding a timeout would throw an exception- which I can then except. This theory may, however be wrong. I have tried to add timeout and handle other errors in the requests part, but to no effect.
-Python 3.5
My code:
import concurrent.futures as cf
import requests
from bs4 import BeautifulSoup
hit_list =[]
processed_list=[]
startrange= 100050000
end_range = 100150000
loop_size=range(startrange,end_range)
workers= 70 #
chunks= 300
url = 'https://ndber.seai.ie/pass/ber/search.aspx'
def id_filter(_range):
with requests.session() as s:
s.headers.update({
'user-agent': 'FOR MORE INFORMATION ABOUT THIS DATA COLLECTION PLEASE CONTACT ########'
})
r = s.get(url)
time.sleep(.1)
soup = BeautifulSoup(r.content, 'html.parser')
viewstate = soup.find('input', {'name': '__VIEWSTATE' }).get('value')
viewstategen = soup.find('input', {'name': '__VIEWSTATEGENERATOR' }).get('value')
validation = soup.find('input', {'name': '__EVENTVALIDATION' }).get('value')
for ber in _range:
data = {
'ctl00$DefaultContent$BERSearch$dfSearch$txtBERNumber': ber,
'ctl00$DefaultContent$BERSearch$dfSearch$Bottomsearch': 'Search',
'__VIEWSTATE' : viewstate,
'__VIEWSTATEGENERATOR' : viewstategen,
'__EVENTVALIDATION' : validation,
}
y = s.post(url, data=data)
if 'No results found' in y.text:
#print('Invalid ID', ber)
else:
hit_list.append(ber)
print('Valid ID',ber)
if __name__ == '__main__':
with cf.ThreadPoolExecutor(max_workers=workers) as boss:
jobs= [loop_size[x: x + chunks] for x in range(0, len(loop_size), chunks)]
boss.map(id_filter, jobs)
#record data below
I'm implementing a crawler for a website with a growing number of entities. There is no information available how many entities exist and no list of all entities. Every entity can be accessed with an URL like this: http://www.somewebsite.com/entity_{i} where {i} is the number of the entity, starting with 1 and incrementing by 1.
To crawle every entity I'm running a loop which checks if a HTTP requests returns a 200 or 404. If I get a 404 NOT FOUND, the loop stops and I'm sure I have all entities.
The serial way looks like this:
def atTheEnd = false
def i = 0
while(!atTheEnd){
atTheEnd = !crawleWebsite("http://www.somewebsite.com/entity_" + i)
i++
}
crawleWebsite() returns true if it succeed and false if it got an 404 NOT FOUND error.
The problem is crawling those entities can take very long that's why I want to do it in multiple threads but I don't know the total amount of entities so every task isn't independent from the other tasks.
Whats the best way to solve this problem?
My approach would be this: Using binary search with REST HEAD requests to get the total number of entities (between 500 and 1000) and split those to some threads.
Is there maybe a better way doing this?
tl;dr
Basically I want to tell a threadpool to programmatically create new tasks until a condition is satisfied (when the first 404 occured) and to wait until every task has finished.
Note: I'm implementing this code using Grails 3.
As you said, the total number of entities is not known and can go into thousands. In this case I would simply go for a fixed thread pool and speculatively query URLs even though you may have already reached the end. Consider this example.
#Grab(group = 'org.codehaus.gpars', module = 'gpars', version = '1.2.1')
import groovyx.gpars.GParsPool
//crawling simulation - ignore :-)
def crawleWebsite(url) {
println "$url:${Thread.currentThread().name}"
Thread.sleep (1)
Math.random() * 1000 < 950
}
final Integer step = 50
Boolean atTheEnd = false
Integer i = 0
while (true) {
GParsPool.withPool(step) {
(i..(i + step)).eachParallel{atTheEnd = atTheEnd || !crawleWebsite("http://www.somewebsite.com/entity_" + it)}
}
if (atTheEnd) {
break
}
i += step
}
The threadpool is set to 50 and once all 50 URLs are crawled we check if we reached the end. If not we carry on.
Obviously in the worst case scenario you can crawl 50 404s. But I'm sure you could get away with it :-)