Web scraping simultaneous request - python-3.x

from urllib import request
import urllib
from bs4 import BeautifulSoup as bs
page = request.urlopen("http://someurl.ulr").read()
soup = (page,"lxml")
Now this process is very slow because it makes one request parses data,
goes through the specified steps and then we go back to making the request.
- for example
for links in findAll('a'):
print (links.href)
and then we go back to making the request because we want to scrape another URL, how does one speed up this process?
Should I store the whole source code of the URL in a file and then do the necessary operations (parse through, find data we need) ---?
I've got this idea because from a DoS(Denial of Service) script that
uses import socks and threads for making a large amount of requests.
Note: This is only an idea,
Is there a more efficient way of doing this?

The most efficient way to this ist most probably using asyncio and at one point spawning as many python processes as you have threads.
asyncio documentation
and call your script like that:
for i in $(seq $(nproc)); do python yourscript.py $entry; done
This will lead to a massive speed improvement. In order to further increase the speed of processing you could use a regex parser instead of Beautifulsoup, this gave me a speedup of about 5 times.
You could as well use a specialized library for this task, e.g. scrapy

Related

Multi-tasking with Multiprocessing or Threading or Asyncio, depending on the Scenario

I have my codes ready for 1 at a time performance, I wanna upgrade it to something fancy, multi-tasking. I am seeking helps about what I can use to achieve my goal.
my codes performs in this order: parsing multi-pages, parsing multi-posts, parsing multi-images. I tried to do multi-pages with multi-processing with pool.map(), it came out with KeyError of Daemonic can't have children processes. My understanding of this multi-tasking procedure is that parsing pages are fast, parsing posts and images can be really long.
What if I do parsing posts and parsing images together on single page, can it be allowed?
Which modules should i use to do so? thread? multiprocessing? asyncio? I went through a lot lately, I am struggling with what I should use.
So off the top of my head you can look at 2 things.
1) Asyncio (be careful this example uses threading and is not thread safe specifically the function asyncio.gather)
import asyncio
for work in [1,2,3,4,5]:
tasks.append(method_to_be_called(work))
results = await asyncio.gather(*tasks)
2) Asyncio + multiprocessing
https://github.com/jreese/aiomultiprocess

Efficient Way of Getting URL Redirect from Persistent URLs

I have a dataset that, in part, has a URL field indicating the location of a resource. Some URLs are persistent (e.g. handles and DOIs) and thus, need to be resolved to their original URL. I am primarily working with Python and the solution that seems to work, thus far, involves using the Requests HTTP library.
import requests
var_output_url = requests.get("http://hdl.handle.net/10179/619")
var_output_url.url
While this solution works, it is quite slow as I have to loop through ~4,000 files, each with around 2,000 URLs. Is there a more efficient way of resolving the URL redirects?
I tested my current solution on one batch and it took almost 5 minutes; at this rate, it will take me a couple of days (13 days) to process all the batches [...] I know, it will not necessarily be that long and I can run them in parallel
Using HEAD instead of GET should give you only headers and not the resource body, which in your example is html page. If you only need resolving url redirections, it would result in quite less time on data transfer over the network. Use parameter allow_redirects=True to allow redirection.
var_output_url = requests.head("http://hdl.handle.net/10179/619", allow_redirects=True)
var_output_url.url
>>> 'https://mro.massey.ac.nz/handle/10179/619'

Why is urllib.request so slow?

When I use urllib.request.decode to get the python dictionary from JSON format it takes far too long. However upon looking at the data, I realized that I don't even want all of it.
Is there any way that I can only get some of the data, for example get the data from one of the keys of the JSON dictionary rather than all of them?
Alternatively, if there was any faster way to get the data that could work as well?
Or is it simply a problem with the connection and cannot be helped?
Also is the problem with the urllib.request.urlopen or is it with the json.loads or with the .read().decode().
The main symptoms of the problem is either taking roughly 5 seconds when trying to receive information which is not even that much (less than 1 page of non-formatted dictionary). The other symptom is that as I try to receive more and more information, there is a point when I simply receive no response from the webpage at all!
The 2 lines which take up the most time are:
response = urllib.request.urlopen(url) # url is a string with the url
data = json.loads(response.read().decode())
For some context on what this is part of, I am using the Edamam Recipe API.
Help would be appreciated.
Is there any way that I can only get some of the data, for example get the data from one of the keys of the JSON dictionary rather than all of them?
You could try with a streaming json parser, but I don't think you're going to get any speedup from this.
Alternatively, if there was any faster way to get the data that could work as well?
If you have to retrieve a json document from an url and parse the json content, I fail to imagine what could be faster than sending an http request, reading the response content and parsing it.
Or is it simply a problem with the connection and cannot be helped?
Given the figures you mentions, the issue is very certainly in the networking part indeed, which means anything between your python process and the server's process. Note that this includes your whole system (proxy/firewall, your network card, your OS tcp/ip stack etc, and possibly some antivirus on window), your network itself, and of course the end server which may be slow or a bit overloaded at times or just deliberately throttling your requests to avoid overload.
Also is the problem with the urllib.request.urlopen or is it with the json.loads or with the .read().decode().
How can we know without timing it on your own machine ? But you can easily check this out, just time the various parts execution time and log them.
The other symptom is that as I try to receive more and more information, there is a point when I simply receive no response from the webpage at all!
cf above - if you're sending hundreds of requests in a row, the server might either throttle your requests to avoid overload (most API endpoints will behave tha way) or just plain be overloaded. Do you at least check the http response status code ? You may get 503 (server overloaded) or 429 (too many requests) responses.

How to kill a program if it runs too long

I am writing a program for school, and I have come across a problem. It is a webcrawler, and sometimes it can get stuck on a url for over 24 hours. I was wondering if there was a way to continue the while loop and go to the next url if it takes over a set amount of time. Thanks
If you are using urllib in Python 3 You can use the timeout argument in the urllib.request.urlopen(url, timeout=<t in secs>). That parameters is used for all blocking operation used internally by urllib.
But if you are using a different library, consult the doc or mention in the question.

Running a method in the background of the Python interpreter

I've made a class in Python 3.x, that acts as a server. One method manages sending and receiving data via UDP/IP using the socket module (the data is stored in self.cmd, and self.msr respectively). I want to be able to modify the the self.msr, self.cmd variables from within the python interpreter online. For example:
>>> from myserver import MyServer
>>> s = MyServer()
>>> s.bakcground_recv_send() # runs in the background, constantly calling s.recv_msr(), s.send_cmd()
>>> process_data(s.msr) # I use the latest received data
>>> s.cmd[0] = 5 # this will be sent automatically
>>> s.msr # I can see what the newest data is
So far, s.bakcground_recv_send() does not exist. I need to manually call s.recv_msr() each time I want to see update the value of s.msr (s.recv_msr uses a blocking socket), and then call s.send_cmd() to send s.cmd.
In this particular case, which module makes more sense: multiprocess or threading?
Any hints how could I best solve this? I have no experience with either processes or threads (just read a lot, but I am still unsure which way to go).
In this case, threading makes most sense. In short, multiprocessing is for running processes on different procesors, threading is for doing things in the background.

Resources