web scraping with beautiful soup in python - web

I want to crawl the homepage of youtube to pull out all the links of videos. Following is the code
from bs4 import BeautifulSoup
import requests
s='https://www.youtube.com/'
html=requests.get(s)
html=html.text
s=BeautifulSoup(html,features="html.parser")
for e in s.find_all('a',{'id':'video-title'}):
link=e.get('href')
text=e.string
print(text)
print(link)
print()
Nothing is happenning when I run the above code. It seems like the id is not getting discovered. What am I doing wrong

It is because you are not getting the same HTML as your browser have.
import requests
from bs4 import BeautifulSoup
s = requests.get("https://youtube.com").text
soup = BeautifulSoup(s,'lxml')
print(soup)
Save this code's output to a file named test.html and run. You will see that it is not the same as the browser's, as it looks corrupted.
See these questions below.
HTML in browser doesn't correspond to scraped data in python
Python requests not giving me the same HTML as my browser is
Basically, I recommend you to use Selenium Webdriver as it reacts as a browser.

Yes, this is a strange scrape, but if you scrape at the 'div id="content"' level, you are able to get the data you are requesting. I was able to get the titles of each video, but it appears youtube has some rate limiting or throttling, so I do not think you will be able to get ALL of the titles and links. At any rate, below is what I got working for the titles:
import requests
from bs4 import BeautifulSoup
url = 'https://www.youtube.com/'
response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, 'html.parser')
links = soup.find_all('div', id='content')
for each in links:
print(each.text)

May be this could help for scraping all videos from youtube home page,
from bs4 import BeautifulSoup
import requests
r = 'https://www.youtube.com/'
html = requests.get(r)
all_videos = []
soup = BeautifulSoup(html.text, 'html.parser')
for i in soup.find_all('a'):
if i.has_attr('href'):
text = i.attrs.get('href')
if text.startswith('/watch?'):
urls = r+text
all_videos.append(urls)
print('Total Videos', len(all_videos))
print('LIST OF VIDEOS', all_videos)

This code snippet will selects all links from youtube.com homepage that contains /watch? in their href attribute (links to videos):
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get('https://www.youtube.com/').text, 'lxml')
for a in soup.select('a[href*="/watch?"]'):
print('https://www.youtube.com{}'.format(a['href']))
Prints:
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=pBhkG2Zwf-c
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=gnn7GwqXek4
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=AMKDVfucPfA
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=daQcqPHx9uw
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=V_MXGdSBbAI
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=KEW9U7s_zks
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=EM7ZR5z3kCo
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=6NPHk-Yd4VU
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=dHiAls8loz4
https://www.youtube.com/watch?v=2_mDOWLhkVU
https://www.youtube.com/watch?v=2_mDOWLhkVU
...and so on

Related

How can I scrape a <h1> tag using BeautifulSoup? [Python]

I am currently coding a price tracker for different websites, but I have run into an issue.
I'm trying to scrape the contents of a h1 tag using BeautifulSoup4, but I don't know how. I've tried to use a dictionary, as suggested in
https://stackoverflow.com/a/40716482/14003061, but it returned None.
Can someone please help? It would be appreciated!
Here's the code:
from termcolor import colored
import requests
from bs4 import BeautifulSoup
import smtplib
def choice_bwfo():
print(colored("You have selected Buy Whole Foods Online [BWFO]", "blue"))
url = input(colored("\n[ 2 ] Paste a product link from BWFO.\n", "magenta"))
url_verify = requests.get(url, headers=headers)
soup = BeautifulSoup(url_verify.content, 'html5lib')
item_block = BeautifulSoup.find('h1', {'itemprop' : 'name'})
print(item_block)
choice_bwfo()
here's an example URL you can use:
https://www.buywholefoodsonline.co.uk/organic-spanish-bee-pollen-250g.html
Thanks :)
This script will print content of <h1> tag:
import requests
from bs4 import BeautifulSoup
url = 'https://www.buywholefoodsonline.co.uk/organic-spanish-bee-pollen-250g.html'
# create `soup` variable from the URL:
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# print text of first `<h1>` tag:
print(soup.h1.get_text())
Prints:
Organic Spanish Bee Pollen 250g
Or you can do:
print(soup.find('h1', {'itemprop' : 'name'}).get_text())

BeautifulSoup doesn't catch the complete link

When I try to get the link on a web page, bs4 doesn't catch the entire link, it stops before the **?ref**.....
I'll explain the question through the code:
imdb_link = "https://www.imdb.com/chart/top?ref_=nv_mv_250"
site = requests.get(imdb_link)
soup = BeautifulSoup(site.text,'lxml')
for items in soup.find("table",class_="chart").find_all(class_="titleColumn"):
link = items.find("a").get('href')
print(link)
The output is:
/title/tt0111161/
/title/tt0068646/
/title/tt0071562/
/title/tt0468569/
/title/tt0050083/
/title/tt0108052/
/title/tt0167260/
...and so on..
But it's wrong, as you can see by seeing the web page, because it might be:
/title/tt0111161/?ref_=adv_li_tt
/title/tt0068646/?ref_=adv_li_tt
...and so on...
How can I get the entire link? I mean the ?ref_=adv_li_tt too?
I use Python 3.7.4
Overall it might be interesting to try and work out how to get the full link - which I think you will need selenium for to allow javascript to run on page, you don't need the full link as seen on rendered page. What you have, with addition of prefix https://www.imdb.com, is perfectly serviceable.
import requests
from bs4 import BeautifulSoup as bs
with requests.Session() as s:
r = s.get('https://www.imdb.com/chart/top?ref_=nv_mv_25')
soup = bs(r.content, 'lxml')
links = ['https://www.imdb.com' + i['href'] for i in soup.select('.titleColumn a')]
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)
You could let selenium load page so content renders then pass over to bs4 to get links as on page:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
d = webdriver.Chrome()
d.get('https://www.imdb.com/chart/top?ref_=nv_mv_25')
soup = bs(d.page_source, 'lxml')
d.quit()
links = ['https://www.imdb.com' + i['href'] for i in soup.select('.titleColumn a')]

results of soup.find are none despite the content exisiting

I'm trying to track the price for a product on amazon using python in jupyter notebook. I've imported bs4 and requests for this task.
When I inspect HTML in the product page I can see <span id="productTitle" class="a-size-large">
However when I try to search for it using soup.find(id = "productTitle") The results come out as None
I've tried using soup.find other id and classes but the results are still None
title = soup.find(id="productTitle")
This is my code to find the id
If I fix this I hope to be able to get the name of my product whose price I will be tracking
That info is stored in various places in return html. Have you check your response to see you are not blocked or getting an unexpected response?
I found it with that id using and strip
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.amazon.com/dp/B00M4LWO8O/')
soup = bs(r.content, 'lxml')
print(soup.select_one('#productTitle').text.strip())
Also,
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.amazon.com/dp/B00M4LWO8O/')
soup = bs(r.content, 'lxml')
print(soup.select_one('#imgTagWrapperId img[alt]')['alt'])

BeautifulSoup returns urls of pages on same website shortened

My code for reference:
import httplib2
from bs4 import BeautifulSoup
h = httplib2.Http('.cache')
response, content = h.request('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html')
soup = BeautifulSoup(content, "lxml")
urls = []
for tag in soup.findAll('a', href=True):
urls.append(tag['href'])
responses = []
contents = []
for url in urls:
try:
response1, content1 = h.request(url)
responses.append(response1)
contents.append(content1)
except:
pass
The idea is, I get the payload of a webpage, and then scrape that for hyperlinks. One of the links is to yahoo.com, the other to 'http://csb.stanford.edu/class/public/index.html'
However the result I'm getting from BeautifulSoup is:
>>> urls
['http://www.yahoo.com/', '../../index.html']
This presents a problem, because the second part of the script cannot be executed on the second, shortened url. Is there any way to make BeautifulSoup retrieve the full url?
That's because the link on the webpage is actually of that form. The HTML from the page is:
<p>Or let's just link to <a href=../../index.html>another page on this server</a></p>
This is called a relative link.
To convert this to an absolute link, you can use urljoin from the standard library.
from urllib.parse import urljoin # Python3
urljoin('http://csb.stanford.edu/class/public/pages/sykes_webdesign/05_simple.html`,
'../../index.html')
# returns http://csb.stanford.edu/class/public/index.html

Asynchronous scraping with Python: grequests and Beautifulsoup4

I am trying to scrape this site . I managed to do it by using urllib and beautifulsoup. But urllib is too slow. I want to have asynchronous requests because the urls are thousands. I found that a nice package is grequests.
example:
import grequests
from bs4 import BeautifulSoup
pages = []
page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
for i in range(1,1000):
pages.append(page)
page="https://www.spitogatos.gr/search/results/residential/sale/r100/m100m101m102m103m104m105m106m107m108m109m110m150m151m152m153m154m155m156m157m158m159m160m161m162m163m164m165m166m167m168m169m170m171m172m173m174m175m176m177m178m179m180m181m182m183m184m185m186m187m188m189m190m191m192m193m194m195m196m197m198m106001m125000m"
page = page + "/offset_{}".format(i*10)
rs = (grequests.get(item) for item in pages)
a=grequests.map(rs)
The problem is that I don't know how to continue and use beautifulsoup. So as to get the html code of every page.
It would be nice to hear your ideas. Thank you!
Refer to the script below, also check the link of the source. It will help.
reqs = (grequests.get(link) for link in links)
resp=grequests.imap(reqs, grequests.Pool(10))
for r in resp:
soup = BeautifulSoup(r.text, 'lxml')
results = soup.find_all('a', attrs={"class":'product__list-name'})
print(results[0].text)
prices = soup.find_all('span', attrs={'class':"pdpPriceMrp"})
print(prices[0].text)
discount = soup.find_all("div", attrs={"class":"listingDiscnt"})
print(discount[0].text)
Source: https://blog.datahut.co/asynchronous-web-scraping-using-python/

Resources