Beautiful soup different output for UK and US sites - python-3.x

I have mainly used this site to find solutions so far, however I am struggling to find a solution as to why I get different soup objects for US and UK versions of the same site, even though they are pretty much the same when using inspect element or developer tools on the websites.
I am in the UK if that is possibly a factor, when parsing ebay US(.com) I get the desired result with regards to the tag names, but when using ebay UK a lot of the html code tag names etc seem to have changed.
The following code is an example of how I create the soup object and find listing elements:
from bs4 import BeautifulSoup
import requests
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313.TR12.TRC2.A0.H0.Xcomputer+keyboard.TRS0&_nkw=computer+keyboard&_sacat=0"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
for listing in soup.findAll('li', {'class': 's-item'}):
try:
link = listing.find('a', {'class': 's-item__link'})
name = listing.find("h3", {"class": "s-item__title"}).get_text()
price = listing.find("span", {"class": "s-item__price"}).get_text()
print(link.get('href'))
print(name)
print(price + "\n")
except:
pass
>>>https://www.ebay.com/itm/USB-WIRED-STYLISH-SLIM-QWERTY-KEYBOARD-UK-LAYOUT-FOR-PC-DESKTOP-COMPUTER-LAPTOP/392095538686?epid=2298009317&hash=item5b4ab71dfe:g:Zp0AAOSwowBbZw7U
>>>USB WIRED STYLISH SLIM QWERTY KEYBOARD UK LAYOUT FOR PC DESKTOP COMPUTER LAPTOP
>>>$7.15
So an example of the issue I am having:
If I was using the US site (if you change the above URL to .com) and want to find the listing titles I can use findAll('li', {'class': 's-item__title'}) from the soup object
However if I am using the UK site (above URL) I can only find the titles using findAll('li', {'class': 'lvtitle'}) This is also the same if I wanted to retrieve the list of listings For the US soup object I can simply use 's-item', but this is not the case for the UK soup object.
I'm pretty new to programming so apologies for my poor explanation.
EDIT: The above code has been edited to show a working script. Using the above code when I run the script on ebay US I get the correct result (link, name, price of each listing) if I run the same script with the ebay UK URL it returns no results. So it does not seem to be due to a mistake in the script itself, the soup object is different for me, but not for others it seems.

even though they are pretty much the same when using inspecting the HTMl on the websites
Programming lesson that you learn fairly early. Pretty much the same != to the same. In software, the difference between a program running and failing can be one char out of a million.
You are using CSS selectors to target various elements on the page. CSS does the styling of the pages. However, what do you notice about the websites (images are attached at the bottom)? The styling is very different and thus at least some of the CSS is different. To a certain level, these are different websites and thus will need separate ways to scrape them (it could be as small as making the target CSS a variable or as large as completely seperate programs just with shared functions).
I am a bit perplexed that you cannot use s-item__title for both. I see it in the CSS of both the USA and UK eBay sites. Check that you are doing it properly, perhaps by posting your code (you must post code) in a new question specifically asking about this.
Companies like eBay are not really pleased with people scraping their websites and probably take measures to defeat such attempts. Changing up the CSS so that scrapers do not have consistent targets is certainly one method they might use to prevent it from occurring.

I recently personally created a project to fetch data from different websites and ebay was one of them using BeautifulSoup. I can tell you from experience that fetching data from ebay is a struggle and behaves in unexpected manner and would give you unexpected results.
One thing you can do is go to that url and right click to inspect the page and see the html layout to see the results you are getting and how can you go around that (maybe by changing your queries in the url). I know you have already done that but the html in their web page is really big and there is probably some small differences that you didn't catch. Perhaps a good idea is to compare the html from the US and the UK outputs as there could be some tag differences between the two and based on the tags in the UK website you can change your findAll method.
Also another (more formal way) to fetch data is by using the ebay API and here is the link for a quick start guide for the US website https://developer.ebay.com/tools/quick-start

Related

How to scrape hrefs disguised with unicode (e.g \u003ca href=\)

I'm trying to scrape relative paths contained in hrefs but they arent showing up in anything but the main soup pull. If I try to pull hrefs or links specifically, the ones I'm looking to scrape just don't show up but I know they are there.
\u003ca href=\"/model/ford-1200\"
\u003ca href=\"/model/ford-1300\"
\u003ca href=\"/model/ford-1400\"
Is there a way to get to create a list of the 20 or so "u003ca href"s on the page? I'm looking for just the part in the quotes (e.g. /model/ford-1200, /model/ford-1300, /model/ford-1400), collected into a list.
Is this a use-case where I'm going to need to sack up and learn javascript scraping?
It turns out the issue wasn't solved through addressing the unicode or using re -- it was an issue with the content not loading enough to be scraped. I solved this through using selenium which allowed me to scrape the hrefs as I normally would.

Web Scraping data from a confusing website

I'm trying to pull data from this website: https://vahan.parivahan.gov.in/vahan4dashboard/vahan/view/reportview.xhtml
I've used beautiful soup in the past for simple things like pulling the bestsellers list from Amazon so I'm a bit familiar with it but this website is very confusing and I'm looking for suggestions on how to get started here.
Essentially what I would like to loop through all the states in the 'State' filter and, for every state, I'd like to loop through every RTO in the 'RTO' filter. Based on this, I want to download the table data.
I know I've not added any code with this question - I need your help to understand how to get started doing this project since I got no idea how to navigate this website.
Thanks for your help!
EDIT:
This is how I'm getting the data:
from bs4 import BeautifulSoup
import requests
page = requests.get("https://vahan.parivahan.gov.in/vahan4dashboard/vahan/view/reportview.xhtml")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
This is me currently trying to find the relevant sections of the data but I'm lost here:
html = list(soup.children)[4]
list(html.children)
data = html.find_all('table')
print(data[0].prettify())
Not sure where to go from here.

Getting empty list when web scraping morningstar

I am trying to iterate through symbols for different mutual funds, and using those scrape some info from their Morningstar profiles. The URL is the following:
https://www.morningstar.com/funds/xnas/ZVGIX/quote.html
In the example above, ZVGIX is the symbol. I have tried using xpath to find the data I need, however that returns empty lists. The code I used is below:
for item in symbols:
url = 'https://www.morningstar.com/funds/xnas/'+item+'/quote.html'
page = requests.get(url)
tree = html.fromstring(page.content)
totalAssets = tree.xpath('//*[#id="gr_total_asset_wrap"]/span/span/text()')
print(totalAssets)
According to
Blank List returned when using XPath with Morningstar Key Ratios
and
Web scraping, getting empty list
that is due to the fact that the page content is downloaded in stages. The answer to the first link suggests using selenium and chromedriver, but that is unpractical given the amount of data that I am interested in scraping. The answer to the second suggests there may be a way to load the content with further requests, but it does not explain how one may formulate those requests. So, how can I apply that solution to my case?
Edit: The code above returns [], in case that was not clear.
In case anyone else ends up here: eventually I solved my problem by analyzing the network requests when loading the desired pages. Following those links led to super simple html pages that held different parts of the original page. So rather than scraping from 1 page, I ended up scraping from around 5 pages for each fund.

webscraping of commercial websites in python

I'm a chemical engineering student. For a project, I'd like to setup a web scraper in python that could catch certain attributes of the different products. For example, I'd like to go to target and scrape info such as material, weight and a photo. So far I've tried using the lxml library, below is the code that I've tried to use unsuccessfully. I'm not interested in web scraping per se, but the data that I can collect from these websites to perform my calculations. I also found that I'll probably need a web crawler to point the scraper towards the websites I need. Anyways, is it possible for you to point me to a source that can teach this stuff for dummies? I've looked online but so far nothing has really worked.
from lxml import html
import requests
page = requests.get('https://www.target.com/p/delta-children-skylar-4-in-1-convertible-crib/-/A-52936884#lnk=sametab')
tree = html.fromstring(page.content)
data = tree.xpath('//div[#id="product-attributes"]/text()')
Thanks in advance!!!

can you have "variables" in text in google sites?

Sorry, this is a bad question. I don't even know what the title should be. I'm a total noob at making websites so this might be easy to find but I just don't know the terminology to search for. I cannot find anything about how to do this...
What I want to do is have something like references/variables that I can use in a block of text and it will automatically get replaced with whatever value should be there. Best way I can think of to describe it would be if I was using the site as a design doc for a game or something, I would be able to type in [Title] or something similar on any page and when it loads that text would be replaced with whatever my Title is. That way If I ever change titles, names, classes, races, places, items, etc... they would only have to be changed in 1 place and the change would be reflected everywhere.
I notice if I add a link to a page it will automatically use the Title of that page as the text of the link. That is almost exactly what I want. Except when I change the Title of the other page the text of the link remains as the original text. It doesn't get updated to the new Title and that is not at all what I want.
Also, I want to do this in Google Sites and as simply as possible. I don't really want to use a database. I was hoping Google Sites would have some kind of funcionality for this.
I don't believe this is possible (on Google Sites) and likely you need to consider a hosted solution.
Quoting the answer from this relevant post:
You should consider hosting your solution using Google's App Engine
instead of Google Sites. You can set it up so it uses PHP (see link
below), you can configure it to use your domain name and you get
enough CPU, disk and bandwidth allowance to serve around five million
page views for free each month, if you are serving more than that,
their prices are extremely competitive.
Google App Engine:
http://code.google.com/appengine/docs/whatisgoogleappengine.html How
to setup PHP using Google App Engine: http://blog.caucho.com/?p=187
Also I'm not sure how your PHP skills are but if you're unfamiliar with it then this should help to get you started.

Resources