Scraping from website list returns a null result based of Xpath - python-3.x

So I'm trying to scrape the job listing off this site https://www.dsdambuster.com/careers .
I have the following code:
url = "https://www.dsdambuster.com/careers"
page = requests.get(url, verify=False)
tree = html.fromstring(page.content)
path = '/html/body/div[1]/section/div/div/div[2]/div[1]/div/div[2]/div/div[9]/div[1]/div[3]/div[*]/div[1]/a[*]/div/div[1]/div'
jobs = tree.xpath(xpath)
for job in jobs:
Title = (job.text)
print(Title)
not too sure why it wouldnt work...

I see 2 issues here:
You are using very bad XPath. It is extremely fragile and not reliable.
Instead of
'/html/body/div[1]/section/div/div/div[2]/div[1]/div/div[2]/div/div[9]/div[1]/div[3]/div[*]/div[1]/a[*]/div/div[1]/div'
Please use
'//div[#class="vf-vacancy-title"]'
You are possibly missing a wait / delay.
I'm not familiar with the way you are using here, but with Selenium that I do familiar with, you will need to wait for the elements completely loaded before extracting their text contents.

Related

How to get a right video url of an Instagram post using python

I am trying to build a program which runs a function that input a url of a post, output the links of images and videos the post contain. It works really good for images. However, when it comes to get the links of videos, it return me a wrong url. I have no idea how to handle this situation.
https://scontent-lax3-2.cdninstagram.com/v/t50.2886-16/86731551_2762014420555254_3542675879337307555_n.mp4?efg=eyJ2ZW5jb2RlX3RhZyI6InZ0c192b2RfdXJsZ2VuLjcyMC5jYXJvdXNlbF9pdGVtIiwicWVfZ3JvdXBzIjoiW1wiaWdfd2ViX2RlbGl2ZXJ5X3Z0c19vdGZcIl0ifQ&_nc_ht=scontent-lax3-2.cdninstagram.com&_nc_cat=106&_nc_ohc=WDuXskvIuLEAX9rj7MU&vs=17877888256532912_3147883953&_nc_vs=HBksFQAYJEdCOXJLd1gyMVdhWUNkQUpBS090UGo3eEhDb3hia1lMQUFBRhUAAsgBABUAGCRHTFBXTUFVTXNPaG5XcW9EQU5KUEE5bEZVdVZxYmtZTEFBQUYVAgLIAQAoABgAGwGIB3VzZV9vaWwBMBUAABgAFuD4nJGH9sE%2FFQIoAkMzLBdAEszMzMzMzRgSZGFzaF9iYXNlbGluZV8xX3YxEQB17gcA&_nc_rid=97e769e058&oe=5EDF10A5&oh=3713c35f89fca1aa9554a281aa3421ed
https://scontent-gmp1-1.cdninstagram.com/v/t50.2886-16/0_0_0_\x00.mp4?_nc_ht=scontent-gmp1-1.cdninstagram.com&_nc_cat=100&_nc_ohc=Wnu_-GvKHJoAX9F_ui1&oe=5EDE8214&oh=7920ac3339d5bf313e898c3cbec85fa2
Here are two urls. The first one is copied from the sources of a web page, while the second one is copied from the data scraped by pyquery. They come from a same Instagram post, same path, but they are totally different. The first one works well, but the second one doesn't. How can I solve this? How can I get a right video url?
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this.
Here is my code related to the question
def getUrls(url):
URL = str(url)
html = get_html(URL)
doc = pq(html)
urls = []
items = doc('script[type="text/javascript"]').items()
for item in items:
if item.text().strip().startswith('window._sharedData'):
js_data = json.loads(item.text()[21:-1], encoding='utf-8')
shortcode_media = js_data["entry_data"]["PostPage"][0]["graphql"]["shortcode_media"]
edges = shortcode_media['edge_sidecar_to_children']['edges']
for edge in edges:
is_video = edge['node']['is_video']
if is_video:
video_url = edge['node']['video_url']
video_url.replace(r'\u0026', "&")
urls.append(video_url)
else:
display_url = edge['node']['display_resources'][-1]['src']
display_url.replace(r'\u0026', "&")
urls.append(display_url)
return urls
Thanks in advance.
There's nothing wrong with your code. This is a known intermittent issue with Instagram, and other people have encountered it too: https://github.com/arc298/instagram-scraper/issues/545
There doesn't appear to be a known fix yet.
Also, while unrelated to your question, it's worth mentioning that you don't need to inspect the display_resources object to get the URL of the image:
display_url = edge['node']['display_resources'][-1]['src']
There is already a display_url property available (I'm guessing you saw this, based on the variable name?). So you can simply do:
display_url = edge['node']['display_url']
I've seen this sometimes when using this Python module instead of HTML-scraping. At least with that module, edge["node"]["videos"]["standard_resolution"]["url"] usually (but not always) gives a non-corrupted value.

Download data from xml using xpath - returns empty list

I am fairly new to using python to collect data from the web. I am interested in writing a script that collects data from an xml webpage. Here is the address:
https://www.w3schools.com/xml/guestbook.asp
import requests
from lxml import html
url = "https://www.w3schools.com/xml/guestbook.asp"
page = requests.get(url)
extractedHtml = html.fromstring(page.content)
guest = extractedHtml.xpath("/guestbook/guest/fname")
print(guest)
I am not certain why this is returning an empty list. I've tried numerous syntax in the xpath statement, so I'm losing confidence my overall structure is correct.
For context, I want to write something that will parse the entire xml webpage and return a csv that can be used within other programs. I'm starting with the basics to make sure I understand how the various packages work. Thank you for any help.
This should do it
import requests
from lxml import html
url = "https://www.w3schools.com/xml/guestbook.asp"
page = requests.get(url)
extractedHtml = html.fromstring(page.content)
guest = extractedHtml.xpath("//guestbook/guest/fname")
for i in guest:
print(i.text)
In the xpath, you need a double-dash in the beginning. Also, this returns a list with elements. The text of each element can be extracted using .text

Web-scraping with recursion - where to put return function

I am trying to scrape a web-site for text. Each page contains a link to a next page, i.e. first page has link "/chapter1/page2.html", which has link "/chapter1/page3.html" Last page has no link. I am trying to write a program that accesses url, prints text of the page, searches through the text and finds url to next page and loops until the very last page which has no url. I try to use if, else and return function, but I do not understand where I need to put it.
def scrapy(url):
result = requests.get(url, timeout=30.0)
result.encoding = 'cp1251'
page = result.text
link = re.findall(r"\bh(.+?)html", page) # finds link to next page between tags
print("Continue parsing next page!")
url = link
print(url)
return(url)
url = "http://mywebpage.com/chapter1/page1.html"
result = requests.get(url, timeout=30.0)
result.encoding = 'cp1251'
page = result.text
link = re.findall(r"\bh(.+?)html", page)
if link == -1:
print("No url!")
else:
scrapy(url)
Unfortunately it doesn't work; it makes only one loop. Could you kindly tell me what I am doing wrong?
A couple of things: To be recursive, scrapy needs to call itself. Second, recursive functions need branching logic for a base case and a recursive case. In other words, you need part of your function to look like this (pseudocode):
if allDone
return
else
recursiveFunction(argument)
for scrapy, you'd want this branching logic below the line where you find the link (the one where you call re.findall). If you don't find a link, then scrapy is done. If you find a link, then you call scrapy again, passing your newly found link. Your scrapy function will probably need a few more small fixes, but hopefully that will get you unstuck with recursion.
If you want to get really good at thinking in terms of recursion, this book is a good one: https://www.amazon.com/Little-Schemer-Daniel-P-Friedman/dp/0262560992

Web-scraping. How make it faster ?

I have to extract some attributes(in my example there is only one: a text description of apps) from webpages. The problem is the time !
Using the following code, indeed, to go on an page, extract one part of HTML and save it, takes about 1.2-1.8 sec per page. A lot of time. Is there a way to make it faster ? I have a lot of pages, x could be also 200000.
I'm using jupiter.
Description=[]
for x in range(len(M)):
response = http.request('GET',M[x] )
soup = BeautifulSoup(response.data,"lxml")
t=str(soup.find("div",attrs={"class":"section__description"}))
Description.append(t)
Thank you
You should consider inspecting the page a bit. If page relies on the Rest API, you could scrape contents that you need by directly getting them from the API. This is much more efficient way than getting the contents from HTML.
To consume it you should check out Requests library for Python.
I would try carving this up into multiple processes per my comment. So you can put your code into a function and use multiprocessing like this
from multiprocessing import Pool
def web_scrape(url):
response = http.request('GET',url )
soup = BeautifulSoup(response.data,"lxml")
t=str(soup.find("div",attrs={"class":"section__description"}))
return t
if __name__ == '__main__':
# M is your list of urls
M=["https:// ... , ... , ... ]
p = Pool(5) # 5 or how many processes you think is appropriate (start with how many cores you have, maybe)
description=p.map(web_scrape, M))
p.close()
p.join()
description=list(description) # if you need it to be a list
What is happening is that your list of urls is getting distributed to multiple processes that run your scrape function. The results all then get consolidated in the end and wind up in description. This should be much faster than if you processed each url one at a time like you are doing currently.
For more details: https://docs.python.org/2/library/multiprocessing.html

Why does this simple Python 3 XPath web scraping script not work?

I was following a web scraping tutorial from here: http://docs.python-guide.org/en/latest/scenarios/scrape/
It looks pretty straight forward and before I did anything else, I just wanted to see if the sample code would run. I'm trying to find the URIs for the images on this site.
http://www.bvmjets.com/
This actually might be a really bad example. I was trying to do this with a more complex site but decided to dumb it down a bit so I could understand what was going on.
Following the instructions, I got the XPath for one of the images.
/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img
The whole script looks like:
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
images = tree.xpath('/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img')
print(images)
But when I run this, the dict is empty. I've looked at the XPath docs and I've tried various alterations to the xpath but I get nothing each time.
I dont think I can answer you question directly, but I noticed the images on the page you are targeting are sometimes wrapped differently. I'm unfamiliar with xpath myself, and wasnt able to get the number selector to work, despite this post. Here are a couple of examples to try:
tree.xpath('//html//body//div//div//table//tr//td//div//a//img[#src]')
or
tree.xpath('//table//tr//td//div//img[#src]')
or
tree.xpath('//img[#src]') # 68 images
The key to this is building up slowly. Find all the images, then find the image wrapped in the tag you are interested in.. etc etc, until you are confident you can find only the images your are interested in.
Note that the [#src] allows us to now access the source of that image. Using this post we can now download any/all image we want:
import shutil
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
cool_images = tree.xpath('//a[#target=\'_blank\']//img[#src]')
source_url = page.url + cool_images[5].attrib['src']
path = 'cool_plane_image.jpg' # path on disk
r = requests.get(source_url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
I would highly recommend looking at Beautiful Soup. For me, this has helped my amateur web scraping ventures. Have a look at this post for a relevant starting point.
This may not be the answer you are looking for, but hopeful it is a starting point / of some use to you - best of luck!

Resources