How to get all the link in page using selenium python? - python-3.x

I have tried the code given below but every time i run the code,there is some links added to missing. I want to get all the links in the page in a list,so that i can go to any link that i want using slicing.
links = []
eles = driver.find_elements_by_xpath("//*[#href]")
for elem in eles:#
url = elem.get_attribute('href')
print(url)
links.append(url)
is there any way to get all the elements without missing any.

sometimes the links reside inside the frames.
Search for the frames in your website using inspect.
so you need to switch the frame first
browser.switch_to.frame("x1")
links = []
eles = driver.find_elements_by_xpath("//*[#href]")
for elem in eles:#
url = elem.get_attribute('href')
print(url)
links.append(url)
browser.switch_to.default_content()

Related

How to collect URL links for pages that are not numerically ordered

When URLs are ordered in a numeric order, it's simple to fetch all the articles in a given website.
However, when we have a website such as https://mongolia.mid.ru/en_US/novosti where there are articles with URLs like
https://mongolia.mid.ru/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/10-iula-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-i-ministra-inostrannyh-del-mongolii-n-enhtajv?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1
How do I fetch all the article URLs on this website? Where there's no numeric order or whatsoever.
There's order to that chaos.
If you take a good look at the source code you'll surely notice the next button. If you click it and inspect the url (it's long, I know) you'll see there's a value at the very end of it - _cur=1. This is the number of the current page you're at.
The problem, however, is that you don't know how many pages there are, right? But, you can programmatically keep checking for a url in the next button and stop when there are no more pages to go to.
Meanwhile, you can scrape for article urls while you're at the current page.
Here's how to do it:
import requests
from lxml import html
url = "https://mongolia.mid.ru/en_US/novosti"
next_page_xpath = '//*[#class="pager lfr-pagination-buttons"]/li[2]/a/#href'
article_xpath = '//*[#class="title"]/a/#href'
def get_page(url):
return requests.get(url).content
def extractor(page, xpath):
return html.fromstring(page).xpath(xpath)
def head_option(values):
return next(iter(values), None)
articles = []
while True:
page = get_page(url)
print(f"Checking page: {url}")
articles.extend(extractor(page, article_xpath))
next_page = head_option(extractor(page, next_page_xpath))
if next_page == 'javascript:;':
break
url = next_page
print(f"Scraped {len(articles)}.")
# print(articles)
This gets you 216 article urls. If you want to see the article urls, just uncomment the last line - # print(articles)
Here's a sample of 2:
['https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/24-avgusta-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-ministrom-energetiki-mongolii-n-tavinbeh?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1', 'https://mongolia.mid.ru:443/en_US/novosti/-/asset_publisher/hfCjAfLBKGW0/content/19-avgusta-2020-goda-sostoalas-vstreca-crezvycajnogo-i-polnomocnogo-posla-rossijskoj-federacii-v-mongolii-i-k-azizova-s-zamestitelem-ministra-inostran?inheritRedirect=false&redirect=https%3A%2F%2Fmongolia.mid.ru%3A443%2Fen_US%2Fnovosti%3Fp_p_id%3D101_INSTANCE_hfCjAfLBKGW0%26p_p_lifecycle%3D0%26p_p_state%3Dnormal%26p_p_mode%3Dview%26p_p_col_id%3Dcolumn-1%26p_p_col_count%3D1']

Scrape all youtube search results

I am trying to collect data from youtube search result. Search term is "border collie" with a filter for videos that were uploaded "Today".
52 videos appear in the search result. However, when I try to parse the page, I only got 20 entries. How do I parse all 52 videos? Any suggestions is appreciated.
P.S. I tried this post for the infinitive page, but it didn't work for youtube.
Current code:
url = 'https://www.youtube.com/results?search_query=border+collie&sp=EgIIAg%253D%253D'
driver = webdriver.Chrome()
driver.get(url)
#waiting for the page to load
sleep(3)
#repeat scrolling 10 times
for i in range(10):
#scroll 1000 px
driver.execute_script('window.scrollTo(0,(window.pageYOffset+1000))')
sleep(3)
response = requests.get(url)
soup = bs(response.text,'html.parser',from_encoding="UTF-8")
source_list = []
duration_list = []
#Scrape source of the video
vids_source = soup.findAll('div',attrs={'class':'yt-lockup-byline'})
for i in vids_source:
source = i.text
source_list.append(source)
#Scrape video duration
vids_badge = soup.findAll('span',attrs={'class':'video-time'})
for i in vids_badge:
duration = i.text
duration_list.append(duration)
I think you are confusing requests and selenium. Requests module can be used to download and scrape without the use of browser. For your requirement, to scroll down and get more results, use Selenium alone and scrape the results using DOM locators like XPATH.
source_list = []
duration_list = []
for i in range(10):
#scroll 1000 px
driver.execute_script('window.scrollTo(0,(window.pageYOffset+1000))')
sleep(3)
elements = driver.find_elements_by_xpath('//div[#class = "yt-lockup-byline"]')
for element in elements:
source_list.append(element.text)
elements = driver.find_elements_by_xpath('//span[#class = "video-time"]')
for element in elements:
duration_list.append(element.text)
So we scroll first and get all the elements text. Scroll again and get all the elements again and so on.. No need to use requests when scraping like this.

Can I get url, that is generated by JavaScript, using Selenium and Python 3?

I write parser using Selenium and Python 3.7 for next site - https://www.oddsportal.com/soccer/germany/bundesliga/nurnberg-dortmund-fNa2KmU4/
I'm interested, to get the url, that is generated by JavaScript, using Selenium in Python 3?
I need to get the url for events from the sites from which the data is taken in the table.
For example. It seems to me that the data in the first line (10Bet) is obtained from this page - https://www.10bet.com/sports/football/germany-1-bundesliga/20190218/nurnberg-vs-dortmund/
How can get url to this page?
To get the URL of all the links on a page, you can store all the elements with tagname a in a WebElement list and then you can fetch the href attribute to get the link of each WebElement.
you can refer to the following code for all the link present in hole page :
List<WebElement> links = driver.findElements(By.tagName("a")); //This will store all the link WebElements into a list
for(WebElement ele: links) // This way you can take the Url of each link
{
String url = ele.getAttribute("href"); //To get the link you can use getAttribute() method with "href" as an argument
System.out.println(url);
}
In case you need the particilar link explicitly you need to pass the xpath of the element
WebElement ele = driver.findElements(By.xpath("Pass the xapth of the element"));
and after that you need to do this
String url = ele.getAttribute("href") //to get the url of the particular element
I have also shared the link with you so you can go and check i have highlighted elements in that
let us know if that helped or not
Try the below code, which will prints the required URL's as per your requirement :
from selenium import webdriver
driver = webdriver.Chrome('C:\\NotBackedUp\\chromedriver.exe')
driver.maximize_window()
driver.get('https://www.oddsportal.com/soccer/germany/bundesliga/nurnberg-dortmund-fNa2KmU4/')
# Locators for locating the required URL's
xpath = "//div[#id='odds-data-table']//tbody//tr"
rows = "//div[#id='odds-data-table']//tbody//tr/td[1]"
print("=> Required URL's is/are : ")
# Fetching & Printing the required URL's
for i in range(1, len(driver.find_elements_by_xpath(rows)), 1):
anchor = driver.find_elements_by_xpath(xpath+"["+str(i)+"]/td[2]/a")
if len(anchor) > 0:
print(anchor[0].get_attribute('href'))
else:
a = driver.find_elements_by_xpath("//div[#id='odds-data-table']//tbody//tr["+(str(i+1))+"]/td[1]//a[2]")
if len(a) > 0:
print(a[0].get_attribute('href'))
print('Done...')
I hope it helps...

Is there a way to scroll down an Instagram page using a scraper?

This is a function I wrote to scrape the image URLs from my Instagram profile.
def ImageList():
url = 'https://www.instagram.com/Username/?hl=en'
data = req.Request(url)
resp = req.urlopen(data)
respData = resp.read()
dat = re.findall(r'"src"\s*:\s*"(.+?)"', str(respData))
print(str(respData))
i = 0
rec = []
for x in dat:
if re.search("/s640x640/", x):
rec.append(x)
return rec
Though it works quite well, it just returns the top 9 URLs or so. I realized it was because the page itself is an infinite scroll page, thus I need to scroll the page to load all the Images and get their URLs.
Is there a was to do it without using drivers (Selenium Webdriver) i.e. write my own code for it.
I know Instagram has an API, the goal here is to make my code self sufficient, please don't bombard me with that, thank you.

Crawler skipping content of the first page

I've created a crawler which is parsing certain content from a website.
Firstly, it scrapes links to the category from left-sided bar.
secondly, it harvests the whole links spread through pagination connected to the profile page
And finally, going to each profile page it scrapes name, phone and web address.
So far, it is doing well. The only problem I see with this crawler is that It always starts scraping from the second page skipping the first page. I suppose there might be any way I can get this around. Here is the complete code I am trying with:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def category_links(mainurl):
req=requests.Session()
response = req.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles) # links to the category from left-sided bar
def next_pagelink(process_links):
req=requests.Session()
response = req.get(process_links).text
tree = html.fromstring(response)
for link in tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href"):
profile_pagelink(link) # the whole links spread through pagination connected to the profile page
def profile_pagelink(procured_links):
req=requests.Session()
response = req.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[#class='name-info']"):
links = titles.xpath(".//a[#class='pro-title']/#href")[0]
target_pagelink(links) # profile page of each link
def target_pagelink(main_links):
req=requests.Session()
response = req.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[#class='container']"):
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
print(name,phone,web)
category_links(url)
The problem with the first page is that it doesn't have a 'pagination' class so this expression : tree.xpath("//ul[#class='pagination']//a[#class='pageNumber']/#href") returns an empty list and the profile_pagelink function never gets executed.
As a quick fix you can handle this case separately in the category_links function :
def category_links(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
if mainurl == "https://www.houzz.com/professionals/":
profile_pagelink("https://www.houzz.com/professionals/")
for titles in tree.xpath("//a[#class='sidebar-item-label']/#href"):
next_pagelink(titles)
Also i noticed that the target_pagelink prints a lot of empty strings as a result of if_exist returning "" . You can skip those cases if you add a condition in the for loop :
for titles in tree.xpath("//div[#class='container']"): # use class='profile-cover' if you get douplicates #
name = if_exist(titles,".//a[#class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', #class, ' '), ' click-to-call-link ')]/#phone")
web = if_exist(titles,".//a[#class='proWebsiteLink']/#href")
if name+phone+web :
print(name,phone,web)
Finally requests.Session is mostly used for storing cookies and other headers which is not necessary for your script. You can just use requests.get and have the same results.

Resources