Web-scraping. How make it faster ? - python-3.x

I have to extract some attributes(in my example there is only one: a text description of apps) from webpages. The problem is the time !
Using the following code, indeed, to go on an page, extract one part of HTML and save it, takes about 1.2-1.8 sec per page. A lot of time. Is there a way to make it faster ? I have a lot of pages, x could be also 200000.
I'm using jupiter.
Description=[]
for x in range(len(M)):
response = http.request('GET',M[x] )
soup = BeautifulSoup(response.data,"lxml")
t=str(soup.find("div",attrs={"class":"section__description"}))
Description.append(t)
Thank you

You should consider inspecting the page a bit. If page relies on the Rest API, you could scrape contents that you need by directly getting them from the API. This is much more efficient way than getting the contents from HTML.
To consume it you should check out Requests library for Python.

I would try carving this up into multiple processes per my comment. So you can put your code into a function and use multiprocessing like this
from multiprocessing import Pool
def web_scrape(url):
response = http.request('GET',url )
soup = BeautifulSoup(response.data,"lxml")
t=str(soup.find("div",attrs={"class":"section__description"}))
return t
if __name__ == '__main__':
# M is your list of urls
M=["https:// ... , ... , ... ]
p = Pool(5) # 5 or how many processes you think is appropriate (start with how many cores you have, maybe)
description=p.map(web_scrape, M))
p.close()
p.join()
description=list(description) # if you need it to be a list
What is happening is that your list of urls is getting distributed to multiple processes that run your scrape function. The results all then get consolidated in the end and wind up in description. This should be much faster than if you processed each url one at a time like you are doing currently.
For more details: https://docs.python.org/2/library/multiprocessing.html

Related

fetching thousands of urls with Newspaper3k and Multiprocessing slows down after few hundred calls

I have a code which is meant to:
a) call an API to get Google SERP results;
b) open each retrieved url with the newspaper3k python3 library, which extracts the text of the news article;
c) save the text of the article into a .txt file.
The implementation of the multiprocessing part is as follows:
def createFile(newspaper_article):
""" function that opens each article, parses it, and saves it to file on disk"""
def main():
p = ThreadPool(10)
p.map(partial(createFile), sourcesList)
p.close()
p.join()
if __name__ == '__main__':
main()
I have also tried with Pool instead of ThreadPool.
The problem is that after fetching and saving a few hundreds articles, it slows down dramatically.
Sometimes it may happen that a link takes some time to load but i'd expect the other routines to keep goin in the meantime.
What am I doing wrong?

Scraping from website list returns a null result based of Xpath

So I'm trying to scrape the job listing off this site https://www.dsdambuster.com/careers .
I have the following code:
url = "https://www.dsdambuster.com/careers"
page = requests.get(url, verify=False)
tree = html.fromstring(page.content)
path = '/html/body/div[1]/section/div/div/div[2]/div[1]/div/div[2]/div/div[9]/div[1]/div[3]/div[*]/div[1]/a[*]/div/div[1]/div'
jobs = tree.xpath(xpath)
for job in jobs:
Title = (job.text)
print(Title)
not too sure why it wouldnt work...
I see 2 issues here:
You are using very bad XPath. It is extremely fragile and not reliable.
Instead of
'/html/body/div[1]/section/div/div/div[2]/div[1]/div/div[2]/div/div[9]/div[1]/div[3]/div[*]/div[1]/a[*]/div/div[1]/div'
Please use
'//div[#class="vf-vacancy-title"]'
You are possibly missing a wait / delay.
I'm not familiar with the way you are using here, but with Selenium that I do familiar with, you will need to wait for the elements completely loaded before extracting their text contents.

Why does this simple Python 3 XPath web scraping script not work?

I was following a web scraping tutorial from here: http://docs.python-guide.org/en/latest/scenarios/scrape/
It looks pretty straight forward and before I did anything else, I just wanted to see if the sample code would run. I'm trying to find the URIs for the images on this site.
http://www.bvmjets.com/
This actually might be a really bad example. I was trying to do this with a more complex site but decided to dumb it down a bit so I could understand what was going on.
Following the instructions, I got the XPath for one of the images.
/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img
The whole script looks like:
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
images = tree.xpath('/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img')
print(images)
But when I run this, the dict is empty. I've looked at the XPath docs and I've tried various alterations to the xpath but I get nothing each time.
I dont think I can answer you question directly, but I noticed the images on the page you are targeting are sometimes wrapped differently. I'm unfamiliar with xpath myself, and wasnt able to get the number selector to work, despite this post. Here are a couple of examples to try:
tree.xpath('//html//body//div//div//table//tr//td//div//a//img[#src]')
or
tree.xpath('//table//tr//td//div//img[#src]')
or
tree.xpath('//img[#src]') # 68 images
The key to this is building up slowly. Find all the images, then find the image wrapped in the tag you are interested in.. etc etc, until you are confident you can find only the images your are interested in.
Note that the [#src] allows us to now access the source of that image. Using this post we can now download any/all image we want:
import shutil
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
cool_images = tree.xpath('//a[#target=\'_blank\']//img[#src]')
source_url = page.url + cool_images[5].attrib['src']
path = 'cool_plane_image.jpg' # path on disk
r = requests.get(source_url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
I would highly recommend looking at Beautiful Soup. For me, this has helped my amateur web scraping ventures. Have a look at this post for a relevant starting point.
This may not be the answer you are looking for, but hopeful it is a starting point / of some use to you - best of luck!

audio file isn't being parsed with Google Speech

This question is a followup to a previous question.
The snippet of code below almost works...it runs without error yet gives back a None value for results_list. This means it is accessing the file (I think) but just can't extract anything from it.
I have a file, sample.wav, living publicly here: https://storage.googleapis.com/speech_proj_files/sample.wav
I am trying to access it by specifying source_uri='gs://speech_proj_files/sample.wav'.
I don't understand why this isn't working. I don't think it's a permissions problem. My session is instantiated fine. The code chugs for a second, yet always comes up with no result. How can I debug this?? Any advice is much appreciated.
from google.cloud import speech
speech_client = speech.Client()
audio_sample = speech_client.sample(
content=None,
source_uri='gs://speech_proj_files/sample.wav',
encoding='LINEAR16',
sample_rate_hertz= 44100)
results_list = audio_sample.async_recognize(language_code='en-US')
Ah, that's my fault from the last question. That's the async_recognize command, not the sync_recognize command.
That library has three recognize commands. sync_recognize reads the whole file and returns the results. That's probably the one you want. Remove the letter "a" and try again.
Here's an example Python program that does this: https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/cloud-client/transcribe.py
FYI, here's a summary of the other types:
async_recognize starts a long-running, server-side operation to translate the whole file. You can make further calls to the server to see whether it's finished with the operation.poll() method and, when complete, can get the results via operation.results.
The third type is streaming_recognize, which sends you results continually as they are processed. This can be useful for long files where you want some results immediately, or if you're continuously uploading live audio.
I finally got something to work:
import time
from google.cloud import speech
speech_client = speech.Client()
sample = speech_client.sample(
content = None
, 'gs://speech_proj_files/sample.wav'
, encoding='LINEAR16'
, sample_rate= 44100
, 'languageCode': 'en-US'
)
retry_count = 100
operation = sample.async_recognize(language_code='en-US')
while retry_count > 0 and not operation.complete:
retry_count -= 1
time.sleep(10)
operation.poll() # API call
print(operation.complete)
print(operation.results[0].transcript)
print(operation.results[0].confidence)
for op in operation.results:
print op.transcript
Then something like
for op in operation.results:
print op.transcript

Python Multiprocessing How can I make script faster?

Python 3.6
I am writing a script to automate me checking to make sure all the links on a website for work.
I have a version of it but it runs slow because the python interpreter is only running one request at a time.
I imported selenium to pull the links down in a list. I started with a list of 41000 links. I got rid of the duplicates now I am down to 7300 links in my list. I am using the request module to just check for the response code. I know multiprocessing is the answer just see a bunch of different methods. Which is the best for my needs?
The only thing I need to keep in mind I can't run to many threads at once so I don't send our webserver threads on our server sky high with request.
Here is the function that checks the links with the python requests module that I am trying to speed up:
def check_links(y):
for ii in y:
try:
r = requests.get(ii.get_attribute('href'))
rc = r.status_code
print(ii.get_attribute('href'), ' ', rc)
except Exception as e:
logf.write(str(e))
finally:
pass
If you just need to apply the same function to all the items in a list, you just need to use a process pool, and map over you inputs. Here is a simple example:
from multiprocessing import pool
def square(x):
return {x: x**2}
p = pool.Pool()
results = p.imap_unordered(square, range(10))
for r in results:
print(r)
In the example I use imap_unordered, but also look at map and imap. You should choose the one that matches your needs the best.

Resources