Scrape all youtube search results - python-3.x

I am trying to collect data from youtube search result. Search term is "border collie" with a filter for videos that were uploaded "Today".
52 videos appear in the search result. However, when I try to parse the page, I only got 20 entries. How do I parse all 52 videos? Any suggestions is appreciated.
P.S. I tried this post for the infinitive page, but it didn't work for youtube.
Current code:
url = 'https://www.youtube.com/results?search_query=border+collie&sp=EgIIAg%253D%253D'
driver = webdriver.Chrome()
driver.get(url)
#waiting for the page to load
sleep(3)
#repeat scrolling 10 times
for i in range(10):
#scroll 1000 px
driver.execute_script('window.scrollTo(0,(window.pageYOffset+1000))')
sleep(3)
response = requests.get(url)
soup = bs(response.text,'html.parser',from_encoding="UTF-8")
source_list = []
duration_list = []
#Scrape source of the video
vids_source = soup.findAll('div',attrs={'class':'yt-lockup-byline'})
for i in vids_source:
source = i.text
source_list.append(source)
#Scrape video duration
vids_badge = soup.findAll('span',attrs={'class':'video-time'})
for i in vids_badge:
duration = i.text
duration_list.append(duration)

I think you are confusing requests and selenium. Requests module can be used to download and scrape without the use of browser. For your requirement, to scroll down and get more results, use Selenium alone and scrape the results using DOM locators like XPATH.
source_list = []
duration_list = []
for i in range(10):
#scroll 1000 px
driver.execute_script('window.scrollTo(0,(window.pageYOffset+1000))')
sleep(3)
elements = driver.find_elements_by_xpath('//div[#class = "yt-lockup-byline"]')
for element in elements:
source_list.append(element.text)
elements = driver.find_elements_by_xpath('//span[#class = "video-time"]')
for element in elements:
duration_list.append(element.text)
So we scroll first and get all the elements text. Scroll again and get all the elements again and so on.. No need to use requests when scraping like this.

Related

Beautiful Soup Value not extracting properly

Recently i was working with python beautiful soup to extract some data and put it into pandas DataFrame.
I used python beautiful soup to extract some of the hotel data from the website booking.com.
I was able to extract some of the attributes very correctly without any empty.
Here is my code snippet:
def get_Hotel_Facilities(soup):
try:
title = soup.find_all("div", attrs={"class":"db29ecfbe2 c21a2f2d97 fe87d598e8"})
new_list = []
# Inner NavigatableString Object
for i in range(len(title)):
new_list.append(title[i].text.strip())
except AttributeError:
new_list=""
return new_list
The above code is my function to retrieve the Facilities of a hotel and return the facilitites List items.
page_no=0
d = {"Hotel_Name":[], "Hotel_Rating":[], "Room_type":[],"Room_price":[],"Room_sqft":[],"Facilities":[],"Location":[]}
while (page_no<=25):
URL = f"https://www.booking.com/searchresults.html?aid=304142&label=gen173rf-1FCAEoggI46AdIM1gDaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGiAg1wcm9qZWN0cHJvLmlvqAIDuAKwwPadBsACAdICJDU0NThkNDAzLTM1OTMtNDRmOC1iZWQ0LTdhOTNjOTJmOWJlONgCBeACAQ&sid=2214b1422694e7b065e28995af4e22d9&sb=1&sb_lp=1&src=theme_landing_index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Fhotel%2Findex.html%3Faid%3D304142%26label%3Dgen173rf1FCAEoggI46AdIM1gDaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGiAg1wcm9qZWN0cHJvLmlvqAIDuAKwwPadBsACAdICJDU0NThkNDAzLTM1OTMtNDRmOC1iZWQ0LTdhOTNjOTJmOWJlONgCBeACAQ%26sid%3D2214b1422694e7b065e28995af4e22d9%26&ss=goa&is_ski_area=0&checkin_year=2023&checkin_month=1&checkin_monthday=13&checkout_year=2023&checkout_month=1&checkout_monthday=14&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1&offset{page_no}"
new_webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(new_webpage.content,"html.parser")
links = soup.find_all("a", attrs={"class":"e13098a59f"})
for link in links:
new_webpage = requests.get(link.get('href'), headers=HEADERS)
new_soup = BeautifulSoup(new_webpage.content, "html.parser")
d["Hotel_Name"].append(get_Hotel_Name(new_soup))
d["Hotel_Rating"].append(get_Hotel_Rating(new_soup))
d["Room_type"].append(get_Room_type(new_soup))
d["Room_price"].append(get_Price(new_soup))
d["Room_sqft"].append(get_Room_Sqft(new_soup))
d["Facilities"].append(get_Hotel_Facilities(new_soup))
d["Location"].append(get_Hotel_Location(new_soup))
page_no += 25
The above code is the main one where the while loop will traverse the linked pages and retrieve the URL's of the pages. After retrieving ,it goes to every page to retrieve the corresponding atrributes.
I was able to retrieve the rest of the attributes correctly but i am not able to retrive the facilities, Like only some of the room facilities are being returned and some are not returning.
Here is my below o/p after making it into a pandas data frame.
Facilities o/p image
Please help me in this Problem as why some are coming and some are not coming.
P.S:- The facilities are available in the website
I have Tried using all the corresponding classes and attributes for retrieval but i am not getting the facilities column properly.
Probably as a predictive measure, the html fetched by the requests don't seem to consistent in their layouts or even the contents.
There might be more possible selectors, but try
def get_Hotel_Facilities(soup):
selectors = ['div[data-testid="property-highlights"]', '#facilities',
'.hp-description~div div.important_facility']
new_list = []
for sel in selectors:
for sect in soup.select(sel):
new_list += list(sect.stripped_strings)
return list(set(new_list)) # set <--> unique
But even with this, the results are inconsistent. E.g.: I tested on this page with
for i in range(10):
soup = BeautifulSoup(cloudscraper.create_scraper().get(url).content)
fl = get_Hotel_Facilities(soup) if soup else []
print(f'[{i}] {len(fl)} facilities: {", ".join(fl)}')
(But the inconsistencies might be due to using cloudscraper - maybe you'll get better results with your headers?)

Output from web scraping with bs4 returns empty lists

I am trying to scrape specific information from a website of 25 pages but when I run my code i get empty lists. My output is supposed to be dictionary with the specific information scraped. Please any help would be appreciated.
# Loading libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import mitosheet
# Assigning column names using class_ names
name_selector = "af885_1iPzH"
old_price_selector = "f6eb3_1MyTu"
new_price_selector = "d7c0f_sJAqi"
discount_selector = "._6c244_q2qap"
# Placeholder list
data = []
# Looping over each page
for i in range(1,26):
url = "https://www.konga.com/category/phones-tablets-5294?brand=Samsung&page=" +str(i)
website = requests.get(url)
soup = BeautifulSoup(website.content, 'html.parser')
name = soup.select(name_selector)
old_price = soup.select(old_price_selector)
new_price = soup.select(new_price_selector)
discount = soup.select(discount_selector)
# Combining the elements into a zipped list to be able to pull the data simultaneously
for names, old_prices, new_prices, discounts in zip(name, old_price, new_price, discount):
dic = {"Phone Names": names.getText(),"New Prices": new_prices.getText(),"Old Prices": old_prices.getText(),"Discounts": discounts.getText()}
data.append(dic)
data
I tested the below and it works for me getting 40 name values.
I wasn't able to get the values using beautiful soup but directly through selenium.
If you decide to use Chrome and PyCharm as I have then:
Open Chrome. Click on three dots near top right. Click on Settings then About Chrome to see the version of your Chrome. Download the corresponding driver here. Save the driver in the PyCharm PATH folder
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# Assigning column names using class_ names
name_selector = "af885_1iPzH"
# Looping over each page
for i in range(1, 27):
url = "https://www.konga.com/category/phones-tablets-5294?brand=Samsung&page=" +str(i)
driver.get(url)
xPath = './/*[#class="' + name_selector + '"]'
name = driver.find_elements(By.XPATH, xPath)

Not getting any data entry with 'find_all' while scraping Spotify Charts webpage

I am trying to scrape the spotify charts containing top 200 songs in India on 2022-02-01. My python code :
#It reads the webpage.
def get_webpage(link):
page = requests.get(link)
soup = bs(page.content, 'html.parser')
return(soup)
#It collects the data for each country, and write them in a list.
#The entries are (in order): Song, Artist, Date, Play Count, Rank
def get_data():
rows = []
soup = get_webpage('https://spotifycharts.com/regional/in/daily/2022-02-01')
entries = soup.find_all("td", class_ = "chart-table-track")
streams = soup.find_all("td", class_= "chart-table-streams")
print(entries)
for i, (entry, stream) in enumerate(zip(entries,streams)):
song = entry.find('strong').get_text()
artist = entry.find('span').get_text()[3:]
play_count = stream.get_text()
rows.append([song, artist, date, play_count, i+1])
return(rows)
I tried printing the entries and streams but get a blank value
entries = soup.find_all("td", class_ = "chart-table-track")
streams = soup.find_all("td", class_= "chart-table-streams")
I have copied/referenced this from Here
and tried running the full script but that gives error : 'NoneType' object has no attribute 'find_all' in the country function. Hence I tried for a smaller section as above.
NoneType suggests that is doesn't find the "Entries" or "Streams", if you print soup it will show you that the selectors set up for entries and streams does not exist.
After checking your soup object, it seems that Cloudflare is blocking your access to Spotify and you need to complete a CAPTCHA to get around this. There is a library set up for bypassing cloudflare called "cloudscraper".

How to change the page no. of a page's search result for Web Scraping using Python?

I am scraping data from a webpage which contains search results using Python.
I am able to scrape data from the 1st search result page.
I want to loop using the same code, changing the search result page with each loop cycle.
Is there any way to do it? Is there a way to click 'Next' button without actually opening the page in a browser?
At a high level this is possible, you will need to use requests or selenium in addition to beautifulsoup.
Here is an example of defining a element and clicking the button by xpath:
html = driver.page_source
soup = BeautifulSoup(html,'lxml')
sleep(1) # Time in seconds.
ele = driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[2]/div/table/tfoot/tr/td/div//button[contains(text(),'Next')]")
ele.click()
Yes, of course you can do what you described. Although you didn't post an actual couple solutions to help you get started.
import requests
from bs4 import BeautifulSoup
url = "http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx"
res = requests.get(url,headers = {"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for page in range(7):
formdata = {}
for item in soup.select("#aspnetForm input"):
if "ctl00$Contenido$GoPag" in item.get("name"):
formdata[item.get("name")] = page
else:
formdata[item.get("name")] = item.get("value")
req = requests.post(url,data=formdata)
soup = BeautifulSoup(req.text,"lxml")
for items in soup.select("#ctl00_Contenido_tblEmisoras tr")[1:]:
data = [item.get_text(strip=True) for item in items.select("td")]
print(data)

How to get all the link in page using selenium python?

I have tried the code given below but every time i run the code,there is some links added to missing. I want to get all the links in the page in a list,so that i can go to any link that i want using slicing.
links = []
eles = driver.find_elements_by_xpath("//*[#href]")
for elem in eles:#
url = elem.get_attribute('href')
print(url)
links.append(url)
is there any way to get all the elements without missing any.
sometimes the links reside inside the frames.
Search for the frames in your website using inspect.
so you need to switch the frame first
browser.switch_to.frame("x1")
links = []
eles = driver.find_elements_by_xpath("//*[#href]")
for elem in eles:#
url = elem.get_attribute('href')
print(url)
links.append(url)
browser.switch_to.default_content()

Resources