I am trying to extract the title of videos saved on my webserver. They're saved under a form and the form has different number of field sets that are subject to change depending on how many videos there are. I am looking to iterate through all of the available field sets and extract the title of the video which I've highlighted in the attached picture.I don't know how many sets there will be so I was thinking of a loop to go through the length(fieldsets) however my unfamilarity with web scraping/dev has left me a little confused.
I've tried a few different things such as:
results = driver.find_elements(BY.XPATH, "/html/body/div[2]/form/fieldset[1]")
However I am unsure how to extract the attributes of the subelements via Selenium.
Thank you
MY webserver HTML page
If you have so many videos as per shown in the images and you need all videos then you have to iterate through all videos by the given below code.
all_videos_title = []
all_videos = driver.find_elements_by_xpath('//fieldset[#class="fileicon"]')
for sl_video in all_vides:
sl_title = sl_video.find_element_by_xpath("//a//#title")
all_videos_title.append(sl_title)
print(all_videos_title)
If any other anchor tag "a" in the "fieldset" tag then in your loop XPath changed. So If It's not working then you have to share the whole page source so I'll provide exact details of the data that you need.
Please try below XPath
//form//ancestor::fieldset//ancestor::a//#title
Related
I am working on a webscraper that navigates into a page with a list of links, then I use Beautiful Soup to retrieve each of the href. However, when I use Selenium again to try to locate those elements and click them, it is not able to find them.
What I am trying with is this. My href being stored in the variable link:
a=self.drive.find_element(By.XPATH,"//a[#href=link]")
Then I tried with the method contains
a=self.drive.find_element(By.XPATH,"//*[contains(#href,link)]")
But now, the seven different links always point to the same element. The links are long and only differ between them by a number. Does that affect how the method contains work? For instance:
...1953102711/?refId=3fa3c155-c1ed-4322-9390-c9f16320dc76&trk=flagship3_search_srp_jobs
...1981395917/?refId=3fa3c155-c1ed-4322-9390-c9f16320dc76&trk=flagship3_search_srp_jobs
What can I do to either, find the element by using the exact search, or avoid repetition using contains?
I am trying to iterate through symbols for different mutual funds, and using those scrape some info from their Morningstar profiles. The URL is the following:
https://www.morningstar.com/funds/xnas/ZVGIX/quote.html
In the example above, ZVGIX is the symbol. I have tried using xpath to find the data I need, however that returns empty lists. The code I used is below:
for item in symbols:
url = 'https://www.morningstar.com/funds/xnas/'+item+'/quote.html'
page = requests.get(url)
tree = html.fromstring(page.content)
totalAssets = tree.xpath('//*[#id="gr_total_asset_wrap"]/span/span/text()')
print(totalAssets)
According to
Blank List returned when using XPath with Morningstar Key Ratios
and
Web scraping, getting empty list
that is due to the fact that the page content is downloaded in stages. The answer to the first link suggests using selenium and chromedriver, but that is unpractical given the amount of data that I am interested in scraping. The answer to the second suggests there may be a way to load the content with further requests, but it does not explain how one may formulate those requests. So, how can I apply that solution to my case?
Edit: The code above returns [], in case that was not clear.
In case anyone else ends up here: eventually I solved my problem by analyzing the network requests when loading the desired pages. Following those links led to super simple html pages that held different parts of the original page. So rather than scraping from 1 page, I ended up scraping from around 5 pages for each fund.
I'm trying to extract some data from an amazon product page.
What I'm looking for is getting the images from the products. For example:
https://www.amazon.com/gp/product/B072L7PVNQ?pf_rd_p=1581d9f4-062f-453c-b69e-0f3e00ba2652&pf_rd_r=48QP07X56PTH002QVCPM&th=1&psc=1
By using the XPath
//script[contains(., "ImageBlockATF")]/text()
I get the part of the source code that contains the urls, but 2 options pop up in the chrome XPath helper.
By trying things out with XPaths I ended up using this:
//*[contains(#type, "text/javascript") and contains(.,"ImageBlockATF") and not(contains(.,"jQuery"))]
Which gives me exclusively the data I need.
The problem that I'm having is that, for certain products ( it can happen within 2 pairs of different shoes) sometimes I can extract the data and other times nothing comes out. I extract by doing:
imagenesString = response.xpath('//*[contains(#type, "text/javascript") and contains(.,"ImageBlockATF") and not(contains(.,"jQuery"))]').extract()
If I use the chrome xpath helper, the data always appears with the xpath above but in the program itself sometimes it appears, sometimes not. I know sometimes the script that the console reads is different than the one that appears on the site but I'm struggling with this one, because sometimes it works, sometimes it does not. Any ideas on what could be going on?
I think I found your problem: Its a captcha.
Follow these steps to reproduce:
1. run scrapy shell
scrapy shell https://www.amazon.com/gp/product/B072L7PVNQ?pf_rd_p=1581d9f4-062f-453c-b69e-0f3e00ba2652&pf_rd_r=48QP07X56PTH002QVCPM&th=1&psc=1
2. view response like scrapy
view(respone)
When executing this I sometimes got a captcha.
Hope this points you in the right direction.
Cheers
I have a task to parse links and titles from YouTube videos from one category (e.g. music).
I know that there's a huge amount of videos so that's my question. How to do it using for example nodeJS?
I have only one idea simply to use phantomJS and scroll scroll scroll down page to get as many videos as I can, but this solution is dumb.
Are there any other solutions, using YouTube API for example or other tools and methods?
https://www.googleapis.com/youtube/v3/search?part=snippet&type=video&videoCategoryId=10&key=Your Key Here
try using this. First find out which category video id you need
Currently Google displays elements in the result excerpts that belongs to the functional part of the site. Is there a way to exclude these elements to get crawled/displayed in google?
Like eEdit, eDelete, etc in the example above.
To exclude the pages from Google's index, block them using the Robots.txt file or if it is just the content then use the "rel="nofollow" tag.
Hope this helps.
Update on my particular situation here: I just found out that the frontend code has been generated in a way where the title and the description meta was identical.
Google is smart enough to expect that if a copy is already displayed in the title of the search result there's no reason to add in to the excerpt as well, instead looks for content - believed to be valuable - from the actual page.
Lessons learned:
there's no way to hide elements from google but keep it visible for your users
if you'd like to have control over the content displayed in google searches, avoid using the same copy in your title and description