Getting empty list when web scraping morningstar - python-3.x

I am trying to iterate through symbols for different mutual funds, and using those scrape some info from their Morningstar profiles. The URL is the following:
https://www.morningstar.com/funds/xnas/ZVGIX/quote.html
In the example above, ZVGIX is the symbol. I have tried using xpath to find the data I need, however that returns empty lists. The code I used is below:
for item in symbols:
url = 'https://www.morningstar.com/funds/xnas/'+item+'/quote.html'
page = requests.get(url)
tree = html.fromstring(page.content)
totalAssets = tree.xpath('//*[#id="gr_total_asset_wrap"]/span/span/text()')
print(totalAssets)
According to
Blank List returned when using XPath with Morningstar Key Ratios
and
Web scraping, getting empty list
that is due to the fact that the page content is downloaded in stages. The answer to the first link suggests using selenium and chromedriver, but that is unpractical given the amount of data that I am interested in scraping. The answer to the second suggests there may be a way to load the content with further requests, but it does not explain how one may formulate those requests. So, how can I apply that solution to my case?
Edit: The code above returns [], in case that was not clear.

In case anyone else ends up here: eventually I solved my problem by analyzing the network requests when loading the desired pages. Following those links led to super simple html pages that held different parts of the original page. So rather than scraping from 1 page, I ended up scraping from around 5 pages for each fund.

Related

Iterating through different fieldsets in a form using Selenium Python

I am trying to extract the title of videos saved on my webserver. They're saved under a form and the form has different number of field sets that are subject to change depending on how many videos there are. I am looking to iterate through all of the available field sets and extract the title of the video which I've highlighted in the attached picture.I don't know how many sets there will be so I was thinking of a loop to go through the length(fieldsets) however my unfamilarity with web scraping/dev has left me a little confused.
I've tried a few different things such as:
results = driver.find_elements(BY.XPATH, "/html/body/div[2]/form/fieldset[1]")
However I am unsure how to extract the attributes of the subelements via Selenium.
Thank you
MY webserver HTML page
If you have so many videos as per shown in the images and you need all videos then you have to iterate through all videos by the given below code.
all_videos_title = []
all_videos = driver.find_elements_by_xpath('//fieldset[#class="fileicon"]')
for sl_video in all_vides:
sl_title = sl_video.find_element_by_xpath("//a//#title")
all_videos_title.append(sl_title)
print(all_videos_title)
If any other anchor tag "a" in the "fieldset" tag then in your loop XPath changed. So If It's not working then you have to share the whole page source so I'll provide exact details of the data that you need.
Please try below XPath
//form//ancestor::fieldset//ancestor::a//#title

Selenium Python and Beautiful Soup not finding the correct element with href?

I am working on a webscraper that navigates into a page with a list of links, then I use Beautiful Soup to retrieve each of the href. However, when I use Selenium again to try to locate those elements and click them, it is not able to find them.
What I am trying with is this. My href being stored in the variable link:
a=self.drive.find_element(By.XPATH,"//a[#href=link]")
Then I tried with the method contains
a=self.drive.find_element(By.XPATH,"//*[contains(#href,link)]")
But now, the seven different links always point to the same element. The links are long and only differ between them by a number. Does that affect how the method contains work? For instance:
...1953102711/?refId=3fa3c155-c1ed-4322-9390-c9f16320dc76&trk=flagship3_search_srp_jobs
...1981395917/?refId=3fa3c155-c1ed-4322-9390-c9f16320dc76&trk=flagship3_search_srp_jobs
What can I do to either, find the element by using the exact search, or avoid repetition using contains?

Python selenium "Timeout Exception" error

I am trying to click on the Financials link of the following URL using Selenium and Python.
https://www.marketscreener.com/DOLLAR-GENERAL-CORPORATIO-5699818/
initially I used the following code
link = driver.find_element_by_link_text('Financials')
link.click()
Sometimes this works and sometimes it doesn't and I get the Element is not Clickable at point (X,Y) error. I have added code to maximise the webpage in case the link was getting overlapped by something.
It seems that the error is because the webpage doesn't always load in time. To overcome this I have been trying to use expected conditions and wait libraries of Selenium.
I came up with the following but it isn't working, I just get TimeoutException.
link = wait(driver, 60).until(EC.element_to_be_clickable((By.XPATH,'//*[#id="zbCenter"]/div/span/table[3]/tbody/tr/td/table/tbody/tr/td[8]/nobr/a/b')))
link.click()
I think XPATH is probably the best choice here or perhaps class name, but there is no ID. I'm not sure if its because the link is inside some table that it isn't working, but it seems odd to me that sometimes it works without having to wait at all.
I have tried Jacob's approach. The problem is I want it to be dynamic so it will work for other companies. Also, when I first land on the summary page the URL has other things at the end so I can't just append /financials to the URL.
This is the URL it gives me: https://www.marketscreener.com/DOLLAR-GENERAL-CORPORATIO-5699818/?type_recherche=rapide&mots=DG
I might have find a way around this:
link = driver.current_url.split('?')[0]
How do I then access this list item and append the string 'financial/' to the list item?
I was looking for a solution when I noticed that clicking on the financial tab takes you to a new URL. In this case I think the simplest solution is just to use the .get() method for that URL.
i.e.
driver.get('https://www.marketscreener.com/DOLLAR-GENERAL-CORPORATIO-5699818/financials/')
This will always take you directly to the financial page! Hope this helps.

Beautiful soup different output for UK and US sites

I have mainly used this site to find solutions so far, however I am struggling to find a solution as to why I get different soup objects for US and UK versions of the same site, even though they are pretty much the same when using inspect element or developer tools on the websites.
I am in the UK if that is possibly a factor, when parsing ebay US(.com) I get the desired result with regards to the tag names, but when using ebay UK a lot of the html code tag names etc seem to have changed.
The following code is an example of how I create the soup object and find listing elements:
from bs4 import BeautifulSoup
import requests
url = "https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313.TR12.TRC2.A0.H0.Xcomputer+keyboard.TRS0&_nkw=computer+keyboard&_sacat=0"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
for listing in soup.findAll('li', {'class': 's-item'}):
try:
link = listing.find('a', {'class': 's-item__link'})
name = listing.find("h3", {"class": "s-item__title"}).get_text()
price = listing.find("span", {"class": "s-item__price"}).get_text()
print(link.get('href'))
print(name)
print(price + "\n")
except:
pass
>>>https://www.ebay.com/itm/USB-WIRED-STYLISH-SLIM-QWERTY-KEYBOARD-UK-LAYOUT-FOR-PC-DESKTOP-COMPUTER-LAPTOP/392095538686?epid=2298009317&hash=item5b4ab71dfe:g:Zp0AAOSwowBbZw7U
>>>USB WIRED STYLISH SLIM QWERTY KEYBOARD UK LAYOUT FOR PC DESKTOP COMPUTER LAPTOP
>>>$7.15
So an example of the issue I am having:
If I was using the US site (if you change the above URL to .com) and want to find the listing titles I can use findAll('li', {'class': 's-item__title'}) from the soup object
However if I am using the UK site (above URL) I can only find the titles using findAll('li', {'class': 'lvtitle'}) This is also the same if I wanted to retrieve the list of listings For the US soup object I can simply use 's-item', but this is not the case for the UK soup object.
I'm pretty new to programming so apologies for my poor explanation.
EDIT: The above code has been edited to show a working script. Using the above code when I run the script on ebay US I get the correct result (link, name, price of each listing) if I run the same script with the ebay UK URL it returns no results. So it does not seem to be due to a mistake in the script itself, the soup object is different for me, but not for others it seems.
even though they are pretty much the same when using inspecting the HTMl on the websites
Programming lesson that you learn fairly early. Pretty much the same != to the same. In software, the difference between a program running and failing can be one char out of a million.
You are using CSS selectors to target various elements on the page. CSS does the styling of the pages. However, what do you notice about the websites (images are attached at the bottom)? The styling is very different and thus at least some of the CSS is different. To a certain level, these are different websites and thus will need separate ways to scrape them (it could be as small as making the target CSS a variable or as large as completely seperate programs just with shared functions).
I am a bit perplexed that you cannot use s-item__title for both. I see it in the CSS of both the USA and UK eBay sites. Check that you are doing it properly, perhaps by posting your code (you must post code) in a new question specifically asking about this.
Companies like eBay are not really pleased with people scraping their websites and probably take measures to defeat such attempts. Changing up the CSS so that scrapers do not have consistent targets is certainly one method they might use to prevent it from occurring.
I recently personally created a project to fetch data from different websites and ebay was one of them using BeautifulSoup. I can tell you from experience that fetching data from ebay is a struggle and behaves in unexpected manner and would give you unexpected results.
One thing you can do is go to that url and right click to inspect the page and see the html layout to see the results you are getting and how can you go around that (maybe by changing your queries in the url). I know you have already done that but the html in their web page is really big and there is probably some small differences that you didn't catch. Perhaps a good idea is to compare the html from the US and the UK outputs as there could be some tag differences between the two and based on the tags in the UK website you can change your findAll method.
Also another (more formal way) to fetch data is by using the ebay API and here is the link for a quick start guide for the US website https://developer.ebay.com/tools/quick-start

Scrapy not extracting data from a certain xpath

I'm trying to extract some data from an amazon product page.
What I'm looking for is getting the images from the products. For example:
https://www.amazon.com/gp/product/B072L7PVNQ?pf_rd_p=1581d9f4-062f-453c-b69e-0f3e00ba2652&pf_rd_r=48QP07X56PTH002QVCPM&th=1&psc=1
By using the XPath
//script[contains(., "ImageBlockATF")]/text()
I get the part of the source code that contains the urls, but 2 options pop up in the chrome XPath helper.
By trying things out with XPaths I ended up using this:
//*[contains(#type, "text/javascript") and contains(.,"ImageBlockATF") and not(contains(.,"jQuery"))]
Which gives me exclusively the data I need.
The problem that I'm having is that, for certain products ( it can happen within 2 pairs of different shoes) sometimes I can extract the data and other times nothing comes out. I extract by doing:
imagenesString = response.xpath('//*[contains(#type, "text/javascript") and contains(.,"ImageBlockATF") and not(contains(.,"jQuery"))]').extract()
If I use the chrome xpath helper, the data always appears with the xpath above but in the program itself sometimes it appears, sometimes not. I know sometimes the script that the console reads is different than the one that appears on the site but I'm struggling with this one, because sometimes it works, sometimes it does not. Any ideas on what could be going on?
I think I found your problem: Its a captcha.
Follow these steps to reproduce:
1. run scrapy shell
scrapy shell https://www.amazon.com/gp/product/B072L7PVNQ?pf_rd_p=1581d9f4-062f-453c-b69e-0f3e00ba2652&pf_rd_r=48QP07X56PTH002QVCPM&th=1&psc=1
2. view response like scrapy
view(respone)
When executing this I sometimes got a captcha.
Hope this points you in the right direction.
Cheers

Resources