Scraping using Beautifulsoup, extracting text - python-3.x

I am trying to extract " ₹ 75" from below line.
<td class="srpTuple__midGrid title_semiBold srpTuple__spacer16 " id="srp_tuple_price">₹ 75
I am trying to extract " ₹ 75". Can anyone please help? :)

You can try it to get the price ₹ 75:
For single td element:
html_doc = """<td class="srpTuple__midGrid title_semiBold srpTuple__spacer16 " id="srp_tuple_price">₹ 75</td"""
soup = BeautifulSoup(html_doc, 'lxml')
price = price = soup.find('td', id="srp_tuple_price").text
print(price)
For multipule td element:
html_doc = """<td class="srpTuple__midGrid title_semiBold srpTuple__spacer16 " id="srp_tuple_price">₹ 75</td"""
soup = BeautifulSoup(html_doc, 'lxml')
prices = soup.find_all('td', id="srp_tuple_price")
for price in prices:
print(price.text)

This will extract the value and compare it against a variable,
srp_tuple_price = '₹ 75'
html = '<td class="srpTuple__midGrid title_semiBold srpTuple__spacer16 " id="srp_tuple_price">₹ 75</td>'
soup = BeautifulSoup(html,features='html.parser')
rows=soup.findAll('td')
for row in rows:
if row.text==srp_tuple_price:
print('Success')
else:
print ('fail')
Output:
Success

Related

For Loop to CSV Leading to Uneven Rows in Python

Still learning Python, so apologies if this is an extremely obvious mistake. I've been trying to figure it out for hours now though and figured I'd see if anyone can help out.
I've scraped a hockey website for their ice skate name and price and have written it to a CSV. The only problem is that when I write it to CSV the rows for the name column (listed as Gear) and the Price column are not aligned. It goes:
Gear Name 1
Row Space
Price
Row Space
Gear Name 2
It would be great to align the gear and price rows next to each other. I've attached a link to a picture of the CSV as well if that helps.
import requests
from bs4 import BeautifulSoup as Soup
webpage_response = requests.get('https://www.purehockey.com/c/ice-hockey-skates-senior?')
webpage = (webpage_response.content)
parser = Soup(webpage, 'html.parser')
filename = "gear.csv"
f = open(filename, "w")
headers = "Gear, Price"
f.write(headers)
for gear in parser.find_all("div", {"class": "details"}):
gearname = gear.find_all("div", {"class": "name"}, "a")
gearnametext = gearname[0].text
gearprice = gear.find_all("div", {"class": "price"}, "a")
gearpricetext = gearprice[0].text
print (gearnametext)
print (gearpricetext)
f.write(gearnametext + "," + gearpricetext)
[What the uneven rows look like][1]
[1]: https://i.stack.imgur.com/EG2f2.png
Would recommend with python 3 to use with open(filename, 'w') as f: and strip() your texts before write() to your file.
Unless you do not use 'a' mode to append each line you have to add linebreak to each line you are writing.
Example
import requests
from bs4 import BeautifulSoup as Soup
webpage_response = requests.get('https://www.purehockey.com/c/ice-hockey-skates-senior?')
webpage = (webpage_response.content)
parser = Soup(webpage, 'html.parser')
filename = "gear1.csv"
headers = "Gear,Price\n"
with open(filename, 'w') as f:
f.write(headers)
for gear in parser.find_all("div", {"class": "details"}):
gearnametext = gear.find("div", {"class": "name"}).text.strip()
gearpricetext = gear.find("div", {"class": "price"}).text.strip()
f.write(gearnametext + "," + gearpricetext+"\n")
Output
Gear,Price
Bauer Vapor X3.7 Ice Hockey Skates - Senior,$249.99
Bauer X-LP Ice Hockey Skates - Senior,$119.99
Bauer Vapor Hyperlite Ice Hockey Skates - Senior,$999.98 - $1149.98
CCM Jetspeed FT475 Ice Hockey Skates - Senior,$249.99
Bauer X-LP Ice Hockey Skates - Intermediate,$109.99
...
I've noticed that gearnametext returns 2\n inside the string. You should try the method str.replace() to remove the \n which are creating you the jump to the next line. Try with:
import requests
from bs4 import BeautifulSoup as Soup
webpage_response = requests.get('https://www.purehockey.com/c/ice-hockey-skates-senior?')
webpage = (webpage_response.content)
parser = Soup(webpage, 'html.parser')
filename = "gear.csv"
f = open(filename, "w")
headers = "Gear, Price"
f.write(headers)
for gear in parser.find_all("div", {"class": "details"}):
gearname = gear.find_all("div", {"class": "name"}, "a")
gearnametext = gearname[0].text.replace('\n','')
gearprice = gear.find_all("div", {"class": "price"}, "a")
gearpricetext = gearprice[0].text
print (gearnametext)
print (gearpricetext)
f.write(gearnametext + "," + gearpricetext)
I changed inside the loop the second line for the gear name for: gearnametext = gearname[0].text.replace('\n','').

Unable to scrape all data using beautiful soup

URL = r"https://www.vault.com/best-companies-to-work-for/law/top-100-law-firms-rankings/year/"
My_list = ['2007','2008','2009','2010']
Year = []
CompanyName = []
Rank = []
Score = []
for I, Page in enumerate(My_list, start=1):
url = r'https://www.vault.com/best-companies-to-work-for/law/top-100-law-firms-rankings/year/{}'.format(Page)
print(url)
Res = requests.get(url)
soup = BeautifulSoup(Res.content , 'html.parser')
data = soup.find('div' ,{'id':'main-content'})
for Data in data:
Title = data.findAll('h3')
for title in Title:
CompanyName.append(title.text.strip())
Rank = data.findAll('div' ,{'class':'rank RankNumber'})
for rank in Rank:
Rank.append(rank)
Score = data.findAll('div' ,{'class':'rank RankNumber'})
for score in Score:
Score.append(score)
I am unable to get all data for title ,Rank ,Score.
I dont know whether i have identified the right tag . and iam unble to extract value from the list rank.
To get you started. First, find all the div.RankItem elements, then within each, find the title, rank, and score.
from bs4 import BeautifulSoup
import requests
resp = requests.get('https://www.vault.com/best-companies-to-work-for/law/top-100-law-firms-rankings/year/2010')
soup = BeautifulSoup(resp.content , 'html.parser')
for i, item in enumerate(soup.find_all("div", {"class": "RankItem"})):
title = item.find("h3", {"class": "MainLink"}).get_text().strip()
rank = item.find("div", {"class": "RankNumber"}).get_text().strip()
score = item.find("div", {"class": "score"}).get_text().strip()
print(i+1, title, rank, score)

Unable to scrape all data

from bs4 import BeautifulSoup
import requests , sys ,os
import pandas as pd
URL = r"https://www.vault.com/best-companies-to-work-for/law/top-100-law-firms-rankings/year/"
My_list = ['2007','2008','2009','2010','2011','2012','2013','2014','2015','2016','2017','2018','2019','2020']
Year= []
CompanyName = []
Rank = []
Score = []
print('\n>>Process started please wait\n\n')
for I, Page in enumerate(My_list, start=1):
url = r'https://www.vault.com/best-companies-to-work-for/law/top-100-law-firms-rankings/year/{}'.format(Page)
print('\nData fetching from : ',url)
Res = requests.get(url)
soup = BeautifulSoup(Res.content , 'html.parser')
data = soup.find('section',{'class': 'search-result CompanyWorkfor RankingMain FindSchools school-results contrastSection d-flex justify-content-center min-height Rankings CompRank'})
if len(soup) > 0:
print("\n>>Getting page source for :" , url)
else:
print("Please Check url :",url)
for i, item in enumerate(data.find_all("div", {"class": "RankItem"})):
year = item.find("i",{"class":"fa-stack fa-2x"})
Year.append(year)
title = item.find("h3", {"class": "MainLink"}).get_text().strip()
CompanyName.append(title)
rank = item.find("div", {"class": "RankNumber"}).get_text().strip()
Rank.append(rank)
score = item.find("div", {"class": "score"}).get_text().strip()
Score.append(score)
Data = pd.DataFrame({"Year":Year,"CompanyName":CompanyName,"Rank":Rank,"Score":Score})
Data[['First','Score']] = Data.Score.str.split(" " , expand =True,)
Data[['hash','Rank']] = Data.Rank.str.split("#" , expand = True,)
Data.drop(columns = ['hash','First'],inplace = True)
Data.to_csv('Vault_scrap.csv',index = False)
For each url the expected output Data for year, rank, title and score is 100 lines, but I'm getting only 10 lines.
You can iterate through the year and pages like this.
import requests
import pandas as pd
url = 'https://www.vault.com/vault/api/Rankings/LoadMoreCompanyRanksJSON'
def page_loop(year, url):
tableReturn = pd.DataFrame()
for page in range(1,101):
payload = {
'rank': '2',
'year': year,
'category': 'LBACCompany',
'pg': page}
jsonData = requests.get(url, params=payload).json()
if jsonData == []:
return tableReturn
else:
print ('page: %s' %page)
tableReturn = tableReturn.append(pd.DataFrame(jsonData), sort=True).reset_index(drop=True)
return tableReturn
results = pd.DataFrame()
for year in range(2007,2021):
print ("\n>>Getting page source for :" , year)
jsonData = page_loop(year, url)
results = results.append(pd.DataFrame(jsonData), sort=True).reset_index(drop=True)

how to access bestbuy item price

I want to check the price of a item from bestbuy website, however, the access is denied. Does anyone have some advice how to access? Thanks!
My code:
import requests
import bs4 as bs
url = "https://www.bestbuy.com/site/lg-65-class-oled-b9-series-2160p-smart-4k-uhd-tv-with-hdr/6360611.p?skuId=6360611"
url_get = requests.get(url)
soup = bs.BeautifulSoup(url_get.content, 'lxml')
with open('url_bestbuy.txt', 'w', encoding='utf-8') as f_out:
f_out.write(soup.prettify())
js_test = soup.find('span', id ='priceblock_ourprice')
if js_test is None:
js_test = soup.find('span', id ='div.price-block')
str = ""
for line in js_test.stripped_strings :
str = line
# convert to integer
str = str.replace(", ", "")
str = str.replace("$", "")
current_price = int(float(str))
your_price = 2000
if current_price < your_price :
print("I can afford it")
else:
print("Price is high please wait for the best deal")
You don't have permission to access "http://www.bestbuy.com/site/lg-65-class-oled-b9-series-2160p-smart-4k-uhd-tv-with-hdr/6360611.p?" on this server.

How to extract tables from different pages? (python)

I want to extract the tables of first serval pages on http://
The tables have been scraped by the code below and they are in a list,
import urllib
from bs4 import BeautifulSoup
base_url = "http://"
url_list = ["{}?page={}".format(base_url, str(page)) for page in range(1, 21)]
mega = []
for url in url_list:
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', {'class': 'table table-bordered table-striped table-hover'})
mega.append(table)
Because it is a list and cannot use 'soup find_all' to extract the items I want so I converted them into bs4.element.Tag to further serach the items
for i in mega:
trs = table.find_all('tr')[1:]
rows = list()
for tr in trs:
rows.append([td.text.replace('\n', '').replace('\xa0', '').replace('\t', '').strip().rstrip() for td in tr.find_all('td')])
rows
The rows only extract the tables of last page. What is the problem of my codes so the previous 19 tables are not been extracted? Thanks!
The length of the two items are not equivalent.I used for i in meaga to obetain i.
len(mega) = 20
len(i) = 5
The problem is pretty simple. In this for loop:
for i in mega:
trs = table.find_all('tr')[1:]
rows = list()
for tr in trs:
rows.append([td.text.replace('\n', '').replace('\xa0', '').replace('\t', '').strip().rstrip() for td in tr.find_all('td')])
You initialize rows = list() in the for loop. So you loop 21 times, but you also empty the list 20 times.
So you need to have it like this:
rows = list()
for i in mega:
trs = table.find_all('tr')[1:]
for tr in trs:
rows.append([td.text.replace('\n', '').replace('\xa0', '').replace('\t', '').strip().rstrip() for td in tr.find_all('td')])

Resources