Web Scraping data from a confusing website

Web Scraping data from a confusing website - python-3.x

I'm trying to pull data from this website: https://vahan.parivahan.gov.in/vahan4dashboard/vahan/view/reportview.xhtml
I've used beautiful soup in the past for simple things like pulling the bestsellers list from Amazon so I'm a bit familiar with it but this website is very confusing and I'm looking for suggestions on how to get started here.
Essentially what I would like to loop through all the states in the 'State' filter and, for every state, I'd like to loop through every RTO in the 'RTO' filter. Based on this, I want to download the table data.
I know I've not added any code with this question - I need your help to understand how to get started doing this project since I got no idea how to navigate this website.
Thanks for your help!
EDIT:
This is how I'm getting the data:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://vahan.parivahan.gov.in/vahan4dashboard/vahan/view/reportview.xhtml")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
This is me currently trying to find the relevant sections of the data but I'm lost here:
html = list(soup.children)[4]
list(html.children)
data = html.find_all('table')
print(data[0].prettify())
Not sure where to go from here.

Related

Beautiful soup different output for UK and US sites

I have mainly used this site to find solutions so far, however I am struggling to find a solution as to why I get different soup objects for US and UK versions of the same site, even though they are pretty much the same when using inspect element or developer tools on the websites.
I am in the UK if that is possibly a factor, when parsing ebay US(.com) I get the desired result with regards to the tag names, but when using ebay UK a lot of the html code tag names etc seem to have changed.
The following code is an example of how I create the soup object and find listing elements:
from bs4 import BeautifulSoup
import requests
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313.TR12.TRC2.A0.H0.Xcomputer+keyboard.TRS0&_nkw=computer+keyboard&_sacat=0"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
for listing in soup.findAll('li', {'class': 's-item'}):
try:
link = listing.find('a', {'class': 's-item__link'})
name = listing.find("h3", {"class": "s-item__title"}).get_text()
price = listing.find("span", {"class": "s-item__price"}).get_text()
print(link.get('href'))
print(name)
print(price + "\n")
except:
pass
>>>https://www.ebay.com/itm/USB-WIRED-STYLISH-SLIM-QWERTY-KEYBOARD-UK-LAYOUT-FOR-PC-DESKTOP-COMPUTER-LAPTOP/392095538686?epid=2298009317&hash=item5b4ab71dfe:g:Zp0AAOSwowBbZw7U
>>>USB WIRED STYLISH SLIM QWERTY KEYBOARD UK LAYOUT FOR PC DESKTOP COMPUTER LAPTOP
>>>$7.15
So an example of the issue I am having:
If I was using the US site (if you change the above URL to .com) and want to find the listing titles I can use findAll('li', {'class': 's-item__title'}) from the soup object
However if I am using the UK site (above URL) I can only find the titles using findAll('li', {'class': 'lvtitle'}) This is also the same if I wanted to retrieve the list of listings For the US soup object I can simply use 's-item', but this is not the case for the UK soup object.
I'm pretty new to programming so apologies for my poor explanation.
EDIT: The above code has been edited to show a working script. Using the above code when I run the script on ebay US I get the correct result (link, name, price of each listing) if I run the same script with the ebay UK URL it returns no results. So it does not seem to be due to a mistake in the script itself, the soup object is different for me, but not for others it seems.

even though they are pretty much the same when using inspecting the HTMl on the websites
Programming lesson that you learn fairly early. Pretty much the same != to the same. In software, the difference between a program running and failing can be one char out of a million.
You are using CSS selectors to target various elements on the page. CSS does the styling of the pages. However, what do you notice about the websites (images are attached at the bottom)? The styling is very different and thus at least some of the CSS is different. To a certain level, these are different websites and thus will need separate ways to scrape them (it could be as small as making the target CSS a variable or as large as completely seperate programs just with shared functions).
I am a bit perplexed that you cannot use s-item__title for both. I see it in the CSS of both the USA and UK eBay sites. Check that you are doing it properly, perhaps by posting your code (you must post code) in a new question specifically asking about this.
Companies like eBay are not really pleased with people scraping their websites and probably take measures to defeat such attempts. Changing up the CSS so that scrapers do not have consistent targets is certainly one method they might use to prevent it from occurring.

I recently personally created a project to fetch data from different websites and ebay was one of them using BeautifulSoup. I can tell you from experience that fetching data from ebay is a struggle and behaves in unexpected manner and would give you unexpected results.
One thing you can do is go to that url and right click to inspect the page and see the html layout to see the results you are getting and how can you go around that (maybe by changing your queries in the url). I know you have already done that but the html in their web page is really big and there is probably some small differences that you didn't catch. Perhaps a good idea is to compare the html from the US and the UK outputs as there could be some tag differences between the two and based on the tags in the UK website you can change your findAll method.
Also another (more formal way) to fetch data is by using the ebay API and here is the link for a quick start guide for the US website https://developer.ebay.com/tools/quick-start

Getting empty list when web scraping morningstar

I am trying to iterate through symbols for different mutual funds, and using those scrape some info from their Morningstar profiles. The URL is the following:
https://www.morningstar.com/funds/xnas/ZVGIX/quote.html
In the example above, ZVGIX is the symbol. I have tried using xpath to find the data I need, however that returns empty lists. The code I used is below:
for item in symbols:
url = 'https://www.morningstar.com/funds/xnas/'+item+'/quote.html'
page = requests.get(url)
tree = html.fromstring(page.content)
totalAssets = tree.xpath('//*[#id="gr_total_asset_wrap"]/span/span/text()')
print(totalAssets)
According to
Blank List returned when using XPath with Morningstar Key Ratios
and
Web scraping, getting empty list
that is due to the fact that the page content is downloaded in stages. The answer to the first link suggests using selenium and chromedriver, but that is unpractical given the amount of data that I am interested in scraping. The answer to the second suggests there may be a way to load the content with further requests, but it does not explain how one may formulate those requests. So, how can I apply that solution to my case?
Edit: The code above returns [], in case that was not clear.

In case anyone else ends up here: eventually I solved my problem by analyzing the network requests when loading the desired pages. Following those links led to super simple html pages that held different parts of the original page. So rather than scraping from 1 page, I ended up scraping from around 5 pages for each fund.

Is it possible to follow a link in a website that has specific text using Python 3?

I'm trying to follow the first two links under "Certified Lists" on this site.
https://dph.georgia.gov/wastewater-management
The date in the URL will change depending on when they add a new list.
So, I just want to be able to navigate to the two links based on their text "Septic Tank Installers" and "Septic Tank Pumpers".
I'm not trying to have anyone write code out for me. I just can't find anything online that lets me know which module to use.
Any and all help is appreciated.
For example, i used this to navigate to this url
dls=https://www.sanantonio.gov/DevServ/CrystalReports/BldgActHDMonticelloPrk.xls'
resp = requests.get(dls)

This can be done using BeautifulSoup library. If you have not installed it, you can do so using
pip install beautifulsoup4
or
python -m pip install beautifulsoup4
Coming back to the question. You can use BeautifulSoup to get the p tag after the h3 tag containing the text "Certified Lists" and then get the next two links after that.
import requests
from bs4 import BeautifulSoup
resp=requests.get('https://dph.georgia.gov/wastewater-management')
soup=BeautifulSoup(resp.text,'html.parser')
h3_next_p=soup.find('h3',text='Certified Lists').find_next('p')
for link in h3_next_p.find_all('a')[:2]:
print(link.get('href'))
Output:
/sites/dph.georgia.gov/files/EnvHealth/Sewage/Contractors/EnvHealthInstallers2019-04-09.pdf
/sites/dph.georgia.gov/files/EnvHealth/Sewage/Contractors/EnvHealthPumpers2019-04-09.pdf
This will return the href as it is in the page source. Use the code below to get a link which you can use.
print('https://dph.georgia.gov/'+link.get('href'))

webscraping of commercial websites in python

I'm a chemical engineering student. For a project, I'd like to setup a web scraper in python that could catch certain attributes of the different products. For example, I'd like to go to target and scrape info such as material, weight and a photo. So far I've tried using the lxml library, below is the code that I've tried to use unsuccessfully. I'm not interested in web scraping per se, but the data that I can collect from these websites to perform my calculations. I also found that I'll probably need a web crawler to point the scraper towards the websites I need. Anyways, is it possible for you to point me to a source that can teach this stuff for dummies? I've looked online but so far nothing has really worked.
from lxml import html
import requests
page = requests.get('https://www.target.com/p/delta-children-skylar-4-in-1-convertible-crib/-/A-52936884#lnk=sametab')
tree = html.fromstring(page.content)
data = tree.xpath('//div[#id="product-attributes"]/text()')
Thanks in advance!!!

Returning the URL Value of Hyperlink After Searching Webpage For Specific Hyperlink in Python

My objective is that I am writing a script designed to download the jpg file contained in a link on a picture wallpaper site. To achieve this, I wanted to basically search a webpage for a hyperlink (defined by a variable, here it is screenres for screen resolution) as this specific website lists out different resolutions for the file. After searching the webpage for the hyperlink (example: 1280x800), I want to return the value of the hyperlink in the form of the url (example: www.testwallpaperwebsite.com/image1020x800.jpg). I would then pass this to other tools to download the jpg. So far, this is the code I have which returns all of the url links on the webpage, but I cant seem to find a method that lets me search for the hyperlink and return just that url:
import urllib.request
import lxml.html
screenres = 1280x800 # this is the hyperlink I am trying to find on webpage, however variable can change depending on screen size
connection = urllib.request.urlopen("www.testwallpaperwebsite.com")
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print(link)
I have found many answers on Stack Overflow about retrieving URLS but nothing that can help me to specifically search for a hyperlink and extract its URL. Is there a way for it to recognize if a group of characters, like my variable, is a hyperlink in the first place? Granted, I am newer to Python and webscraping so I am still trying to get a handle on the mechanics of extracting information. Furthermore, if this simple task could be accomplished by using just urllib/lxml/other native modules rather than downloading BeautifulSoup, that would be great. However, if there is no other way, or it would be extremely complicated to do it otherwise, I could definitely download and use BeautifulSoup; just wanted to try and reduce the script's complexity. Thank you for your help!

You should definitely try BeautifulSoup. Check out their documentation here. You can find images by using something like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all("a", href=re.compile("YOUR REGEX")

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Web Scraping data from a confusing website - python-3.x

Related

Beautiful soup different output for UK and US sites

Getting empty list when web scraping morningstar

Is it possible to follow a link in a website that has specific text using Python 3?

webscraping of commercial websites in python

Returning the URL Value of Hyperlink After Searching Webpage For Specific Hyperlink in Python

Categories

Resources